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A  TUTORIAL  ON  EM-BASED  DENSITY  ESTIMATION 
WITH  HISTOGRAM  INTENSITY  DATA 


1.  INTRODUCTION 

Observed  measurements  from  real-world  physical  systems  are  inherently  stochastic 
because  of  noise  in  the  propagation  medium  and  measurement  system,  and  because 
of  random  variabilities  in  the  source-generating  mechanism.  Therefore,  the  essential 
problem  in  estimation  is  not  to  identify  the  “true  value”  of  a  variable  of  interest  (such 
as  spatial  location  or  frequency),  but  to  accurately  characterize  the  probability  density 
function  (PDF)  associated  with  that  variable  of  interest.  In  most  real-world  situations, 
there  are  no  such  things  as  numbers ;  there  are  only  distributions.  This  report  is  about 
the  characterization  of  those  distributions. 

The  notion  that  data  collection  is  an  exercise  in  distributions  is  consistent  with 
the  way  observations  of  a  physical  variable  are  actually  obtained.  That  is,  observations 
from  digital  processing  systems  are  usually  obtained  by  partitioning  the  range  of  the 
physical  variable  into  bins ,  and  then  measuring  the  energy  that  falls  within  each  bin 
(or.  equivalently,  the  energy  intensity  associated  with  each  bin).  Characteristics  of  the 
variable  of  interest  are  then  inferred  from  the  energy  in  these  bins.  The  estimation  error 
in  this  inference  process  is  related  to  the  amount  of  probability  mass  (i.e.,  area  under 
the  PDF)  that  is  not  encapsulated  in  the  estimator.  Thus,  if  the  PDF  governing  the 
spread  of  energy  in  the  variable  of  interest  is  concentrated  enough  that  “nearly  all”  of 
the  probability  mass  falls  within  a  single  bin,  then  viewing  observed  data  as  point  mea¬ 
surements  (i.e.,  numbers)  is  a  reasonable  approximation.  In  situations  where  the  PDF 
has  probability  mass  extending  across  several  bins,  however,  such  point  approximations 
can  lead  to  significant  information  loss  and  often  to  bias. 

This  report  examines  histogram  estimation  methods  for  representing  intensity  data 
using  parameterized  PDF  models,  which  provide  a  mechanism  to  significantly  reduce 
the  information  loss  and  bias  that  result  from  point  approximations.  While  paramet¬ 
ric  density  estimation  has  a  long  history,  modern  treatments  of  the  problem  usually 
trace  back  to  the  seminal  paper  by  Dempster,  Laird,  and  Rubin  [1],  who,  among  other 
things,  applied  the  expectation-maximization  (EM)  algorithm  for  parameter  estima¬ 
tion  with  histogram  data.  McLachlan  and  Jones  [2]  extended  the  algorithm  in  1]  by 
applying  EM-based  histogram  estimation  to  static  mixture  densities.  Luginbuhl  [3], 
Luginbuhl  and  Willett  [4],  and  Streit  [5]  took  this  a  step  further  by  applying  histogram- 
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estimation  methods  for  dynamic  mixtures  within  the  probabilistic  multi-hypothesis 
tracking  (PMHT)  algorithm.  The  focus  here  is  on  the  static-mixture  histograms  dis¬ 
cussed  by  McLachlan  and  Jones  [2],  with  the  objective  of  providing  enough  detail  to 
allow  these  models  and  algorithms  to  be  applied  as  they  stand  (e.g.,  in  signal  classifi¬ 
cation  applications)  or  extended  for  dynamic  tracking  contexts  other  than  PMHT. 

The  organization  of  this  report  is  intended  to  develop  the  concepts  in  increasing 
detail,  ultimately  linking  all  aspects  of  histogram-based  algorithms  back  to  fundamental 
statistical  principles.  The  next  section  gives  a  high-level  overview  of  histogram  model¬ 
ing,  introducing  the  dominant  issues  and  motivations.  Sections  3  and  4  then  provide 
detailed  mathematical  developments  of  the  algorithms.  In  particular,  histogram  meth¬ 
ods  for  non-mixture  distributions  are  developed  in  section  3,  and  these  are  extended 
to  mixture  distributions  in  section  4.  After  a  brief  summary  in  section  5,  a  set  of  ap¬ 
pendixes  discuss  supporting  developments  from  optimization  and  distribution  theory.  In 
addition  to  reviewing  material  that  is  important  to  understanding  the  algorithms,  these 
appendixes  introduce  much  of  the  notation  used  in  the  body  of  the  report.  For  exam¬ 
ple,  the  discussion  of  the  EM  algorithm  in  appendix  A  provides  a  notational  and  logical 
template  that  is  used  repeatedly  when  discussing  histogram  estimation  algorithms  in 
sections  3  and  4. 
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2.  OVERVIEW  OF  HISTOGRAM  MODELING 


The  objective  is  to  define  statistical  models  that  characterize  the  behavior  of  vari¬ 
ables  of  interest  (such  as  bearing  or  frequency)  for  some  class  of  sources,  a  problem 
that  arises  in  a  number  of  applications.  For  example,  in  maximum-likelihood  classi¬ 
fication  (e.g.,  see  [6]),  PDF  models  are  use  to  represent  the  behavior  of  the  features 
under  various  class  hypotheses.  The  classifier  decision  boundaries  are  then  found  at 
certain  intersections  of  these  class-conditional  densities.  As  another  example,  proba¬ 
bilistic  tracking  algorithms  (e.g.,  see  [7])  estimate  parameters  in  a  dynamic  PDF  model 
at  each  time  step.  Regardless  of  the  application,  the  goal  when  developing  PDF  models 
is  usually  to  provide  a  “best  fit”  for  recorded  data.  However,  most  sensors  do  not  pro¬ 
vide  direct  measurements  of  the  variables  of  interest  (e.g.,  recorded  data  are  not  tagged 
with  location  or  frequency  information).  Information  about  the  desired  physical  vari¬ 
able  is  usually  derived  by  transforming  the  received  data  (e.g.,  beamforming  of  spatial 
array  data  or  Fourier  transform  of  time-domain  data).  An  inherent  characteristic  of 
this  transformation  process  is  a  partitioning  of  the  variable  domain  into  bins  and  the 
generation  of  output  values  that  correspond  to  these  bins.  These  transform  outputs  are 
usually  further  transformed  to  magnitude-squared,  or  energy  intensity ,  data  because 
the  original  transform  outputs  are  complex  quantities  whose  phase  is  very  sensitive  to 
noise.  Therefore,  PDF  models  are  derived  from  data  that  take  the  form  of  energy  as  a 
function  of  transform  bins  (e.g.,  energy  as  a  function  of  beam  or  frequency). 

The  traditional  approach  for  processing  this  type  of  intensity  data  extracts  “point 
observations”  of  the  physical  variable  using  a  peak  estimator ,  which  selects  one  or  more 
values  of  the  variable  for  which  the  intensity  data  are  locally  maximum.  While  simple 
interpolation  methods  can  significantly  improve  accuracy  when  signal  energy  extends 
over  just  a  few  bins,  this  approach  is  inappropriate  when  the  spread  in  the  energy 
extends  over  several  transform  bins.  Histogram  methods  accommodate  large  intensity 
spreads  by  employing  distribution  models  with  nonzero  second  (and  possibly  higher) 
moment  characteristics. 

This  section  introduces  the  basic  ideas  and  representation  abilities  of  the  histogram 
methods,  as  well  as  some  of  the  issues  that  are  addressed  in  the  later  sections.  The  first 
subsection  introduces  the  histogram  approximation.  The  second  discusses  issues  that 
arise  in  the  application  of  histogram  methods  to  acoustic  intensity  data.  The  third 
motivates  the  use  of  mixture  densities  in  a  histogram  context,  and  the  last  subsection 
discusses  issues  that  arise  when  the  range  of  available  measurements  does  not  fully  cover 
the  range  of  the  PDF. 
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2.1  HISTOGRAM  APPROXIMATION 


The  histogram  algorithms  discussed  here  fall  in  the  general  class  of  parametric 
statistical  modeling  techniques,  where  a  PDF  model  p( z;  O)  is  used  to  represent  the 
energy  distribution  in  the  physical  variable  z  of  interest.  The  characteristics  of  the 
model  are  governed  by  the  parametric  structure  of  the  model  and  by  the  values  in  the 
parameter  set  ©,  which  must  be  estimated  from  observed  data.  The  difficulty  with  this 
situation  is  that  there  are  no  direct  observations  of  z.  The  parameter  vector  0  must 
be  estimated  from  the  energy  intensity  data,  denoted  S  =  {s^  :  l  =  1, . . .  ,  L},  where 
S(  represents  the  energy  in  the  £th  bin  and  the  collection  of  bins  covers  some  range  of 
interest  in  the  values  of  z. 

A  natural  approach  for  overcoming  this  difficulty  is  to  transform  the  PDF  model 
p(z;  0)  into  another  valid  PDF  model  p(S;0),  allowing  0  to  be  estimated  from  the 
intensity  data.  When  a  convenient  mathematical  form  for  p(S;  0)  is  available,  then 
various  optimization  methods  can  be  applied  directly  (e.g.,  Newton  or  other  derivative- 
based  ascent  method).  Typically,  however,  the  task  of  expressing  S  directly  in  terms 
of  0  is  highly  nontrivial,  and  the  resulting  expression,  if  it  can  be  obtained,  is  highly 
nonlinear.  For  this  reason,  histogram  methods  employ  an  approximate  model  in  which 

.-•j  J  m. 

p( z:  0)  is  used  in  conjunction  with  a  multinomial  approximation  to  represent  the  within- 
bin  and  across-bin  characteristics  of  the  intensity  data.  Estimation  algorithms  are  then 
developed  using  an  iterative 'EM  approach.  A  brief  overview  of  the  EM  approach  is 
provided  in  appendix  A,  which  establishes  a  template  for  the  algorithm  descriptions  in 
sections  3  and  4. 

The  use  of  the  EM  algorithm  and  multinomial  approximation  invokes  a  quantum 
image  of  intensity  data,  analogous  to  photons  of  light  energy.  The  variable  z  of  interest 
is  considered  to  be  a  feature  of  the  quantum  particles,  and  the  particles  are  assumed 
to  be  sorted  into  bins  according  to  the  value  of  z  associated  with  each  particle.  The 
number  of  particles  in  each  bin,  denoted  m e  for  the  £th  bin,  is  referred  to  synonymously 
as  a  histogram  intensity ,  bin  intensity ,  histogram  count ,  particle  count ,  or  bin  count. 
The  symbol  me  is  used  to  emphasize  the  discrete  nature  of  the  histogram  intensities, 
in  contrast  to  real-valued  energy  intensities.  The  relationship  of  PDF  model,  point 
measurements,  and  histogram  intensities  is  illustrated  in  figure  1. 

For  sensing  modalities  in  which  particle  counts  are  actually  observed  (e.g.,  ionizing 
radiation),  or  when  there  exists  a  one-to-one  physical  correspondence  between  energy 
intensity  and  particle  count  me  (e.g.,  electromagnetic  energy),  the  histogram  approach 
is  generally  valid.  The  only  question  in  these  cases  concerns  the  appropriateness  of  the 
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Figure  1:  PDF,  Measurements,  and  Histogram  Intensities 
The  PDF  (top)  corresponds  to  a  hypothetical  scalar  distribution.  The 
points  measurements  (middle)  are  synthesized  by  sampling  from  the 
PDF,  and  these  measurements  are  assigned  to  the  various  histogram 
bins  with  boundaries  given  by  the  vertical  grid  lines.  The  numbers  of 
point  measurements  in  each  bin  correspond  to  the  histogram  bin  inten¬ 
sities  (bottom).  In  practice,  the  point  measurements  are  not  observed 
(indeed,  they  may  not  even  exist),  such  that  histogram  methods  must 
estimate  the  PDF  parameters  from  the  intensity  data.  EM-based  his¬ 
togram  methods  postulate  the  existence  of  the  point  measurements  in 
order  to  obtain  an  easier  optimization  problem.  These  point  measure¬ 
ments  are  treated  as  missing  data,  however,  and  are  marginalized  from 
the  objective  function  during  each  EM  iteration.  Note  that  the  point 
measurements  shown  in  this  figure  are  one- dimensional  in  the  horizon¬ 
tal  axis;  the  vertical  spread  in  the  measurements  is  included  merely  to 
illustrate  the  clustering  of  measurements. 
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parametric  form  for  p(z;@).  When  using  the  histogram  model  for  acoustic  intensity 
data,  however,  there  is  no  physical  mapping  from  to  rri(.  The  histogram  model  is 
therefore  inherently  mismatched  to  the  data  in  this  case. 

2.2  HISTOGRAM  APPROXIMATION  AND  ACOUSTIC  DATA 

When  applied  to  acoustic  intensity  data,  the  histogram  model  appears  to  exhibit 
some  insurmountable  flaws,  namely,  the  presumption  of  particles  and  point  measure¬ 
ments  that  don’t  really  exist,  as  well  as  the  real- valued  nature  of  the  energy  intensities 
in  a  theory  requiring  integer  count  data.  From  the  perspective  of  practical  implementa¬ 
tion,  there  are  two  factors  that  ameliorate  these  problems.  First,  the  actual  estimators 
for  the  model  parameters  in  ©  are  obtained  by  integrating  over  the  space  of  the  point 
measurements,  whereby  the  point  measurements  are  marginalized  out  of  the  likelihood 
function.  This  marginalization  takes  place  during  the  derivation  of  the  estimators,  such 
that  the  point  measurement  z  never  appears  in  any  computational  algorithm.  Sec¬ 
ond,  the  parameter  estimators  end  up  being  functions  only  of  the  relative  intensities, 
which  are  the  individual  bin  intensities  divided  by  the  overall  intensity.  The  numeri¬ 
cal  algorithms  do  not  care  if  these  ratios  are  formed  from  integer- valued  count  data  or 
real-valued  energy  data,  since  the  ratio  is,  in  general,  real  in  both  cases. 

Now,  just  because  the  algorithm  can  be  applied  does  not  make  it  the  right  tool. 
It  is  therefore  of  interest  to  take  a  closer  look  at  a  common  special  case,  specifically 
magnitude-squared  discrete  Fourier  transform  (DFT)  data.  The  DFT  forms  the  inner 
product  of  recorded  time-domain  data  with  a  set  of  complex  sinusoidal  basis  functions. 
Since,  in  practice,  it  is  impossible  to  observe  an  infinite  duration  of  the  time-domain 
signal,  it  is  impossible  to  localize  the  energy  to  a  single  frequency  point.  The  DFT 
sample  therefore  represents  the  energy  “in  the  vicinity  of”  a  given  frequency,  such  that 
the  partitioning  of  the  frequency  domain  into  bins  is  a  natural  result  of  the  transform 
process  and  the  DFT  samples  correspond  to  the  energy  that  is  projected  into  each  bin. 
This  projection  of  energy'  into  DFT  bins  has  the  flavor  of  a  histogram  by  definition, 
although  the  histogram  model  takes  this  a  step  farther  by  assuming  a  quantization  of 
the  DFT  bin  energies  to  take  integer  values,  as  if  there  were  some  basic  unit  of  acoustic 
energy  (i.e. ,  the  acoustic  equivalent  of  Planck’s  constant)  and  the  quantized  integers 
represent  the  number  of  these  units  that  fall  within  each  DFT  bin.  This  quantization 
implies  a  set  of  “synthetic”  particles  for  each  bin,  and  the  number  of  these  synthetic 
particles  is  indeed  a  histogram  count,  even  if  it  does  not  correspond  to  any  known  physi¬ 
cal  entity.  The  multinomial  model  at  the  heart  of  histogram  methods  then  characterizes 
these  synthetic  count  data. 
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The  multinomial  model  starts  out  by  treating  histogram  count  data  as  statistically 
independent  Poisson  processes.  The  histogram  count  for  each  bin  is  then  modeled  using 
a  conditional  distribution,  where  the  conditioning  is  on  the  total  synthetic  intensity  in  all 
bins  (the  quantized  version  of  the  total  signal  energy).  Now,  the  total  synthetic  intensity 
is  merely  the  sum  of  the  individual  intensities,  and  a  sum  of  Poisson  processes  is  itself 
a  Poisson  process  whose  expected  value  is  the  sum  of  the  individual  expected  values 
[8].  Furthermore,  since  the  conditioning  process  is  equivalent  to  dividing  probability 
mass  functions,  the  multinomial  model  is  effectively  a  series  of  ratios  of  Poisson  mass 
functions.  Note  that,  while  the  synthetic  intensities  for  the  various  DFT  bins  are  initially 
represented  as  independent  processes,  the  bin  intensities  are  not  independent  in  the 
multinomial  model  because  of  the  conditioning  on  the  total  intensity.  That  is,  all  bins 
affect  all  other  bins  through  their  contribution  to  the  total  intensity.  Note  also  that 
modeling  the  quantized  intensities  with  a  multinomial  distribution  provides  a  statistical 
description  that  matches  the  first  moment  of  the  observed  data  but  does  not  match  any 
higher  moments. 

Given  the  subtleties  of  the  multinomial  approximation,  its  appropriateness  must 
be  evaluated  on  a  case-by-case  basis  for  different  applications.  That  said,  the  histogram 
approach  provides  a  way  forward  in  cases  where  the  alternatives  are  intractable.  For 
example,  an  attempt  to  directly  model  the  energy  intensities  would  require  a  joint 
multivariate  exponential  distribution  over  all  of  the  spectral  or  spatial  bins  (i.e.,  a 
joint  distribution  over  easily  thousands  of  variables,  and  more  as  resolution  increases). 
Estimation  in  such  high-dimensional  spaces  is  impossible  in  most  real-world  scenarios. 

In  many  applications,  the  quantization  of  the  bin  intensities  is  implicit  because  the 
algorithms  depend  only  on  the  relative  intensities.  However,  the  quantization  manifests 
itself  in  a  very  explicit  w*ay  in  Bayesian  contexts  where  a  prior  distribution  is  imposed. 
The  problem  involves  the  relative  weighting  of  the  prior  distribution  and  the  measure¬ 
ment  likelihoods  when  estimating  the  posterior  parameter  estimates.  Specifically,  the 
measurement  likelihoods  are  weighted  by  the  overall  intensity  of  the  bins  (i.e.,  the  total 
number  of  particles  in  all  bins).  The  more  “measurements'’  there  are,  the  more  the 
prior  distribution  is  discounted.  But  when  dealing  with  artificially  quantized  intensity 
data,  the  overall  intensity  depends  on  the  unit  energy  associated  with  each  particle, 
which  is  itself  a  function  of  the  quantization.  The  net  result  is  that  the  quantization 
unit  becomes  an  explicit  variable  in  the  algorithm,  making  the  relative  weighting  of  the 
prior  completely  dependent  on  an  algorithm  design  parameter.  If  a  coarse  quantization 
(i.e.,  a  large  particle  energy  unit)  is  assumed,  then  the  synthetic  particle  counts  will  be 
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low  and  the  prior  distribution  will  dominate.  If  a  very  fine  quantization  (i.e.,  a  small 
particle  energy  unit)  is  used,  then  the  apparent  number  of  particles  is  very  high  and 
the  data  overwhelm  the  prior.  Indeed,  in  the  limit  as  the  quantization  unit  approaches 
zero,  the  synthetic  particle  counts  become  infinite  and  the  estimator  effectively  ignores 
the  prior  distribution  altogether,  a  problem  that  was  observed  by  Streit  in  his  work  of 
histogram  PMHT  [5].  This  same  issue  will  arise  with  any  dynamic  mixture  scenario 
in  which  the  overall  intensity  varies  with  time,  which  includes  just  about  all  practical 
tracking  applications. 

As  a  final  note,  the  dependence  of  the  parameter  estimators  only  on  the  relative 
intensities  also  means  that  the  estimators  are  invariant  to  the  true  overall  intensity. 
The  degree  of  freedom  introduced  by  this  invariance  allows  the  same  model  to  represent 
large  variabilities  in  different  realizations  of  the  distribution.  To  illustrate  this,  figures 

2  and  3  show  examples  of  synthetic  histogram  data  with  different  values  of  signal-to- 
noise-ratio  (SNR)  and  overall  intensity.  In  figure  2,  plots  are  shown  with  the  same 
value  of  SNR,  but  with  different  values  of  the  overall  intensities.  In  contrast,  figure 

3  contains  plots  with  the  same  overall  intensity  but  with  varying  SNR.  While  there  is 
significant  variability  in  both  cases,  there  is  a  subtle  distinction.  Low  SNR  generally 
means  that  there  is  too  much  of  the  wrong  kind  of  data  (i.e.,  noise),  whereas  low  overall 
intensity  indicates  that  there  is  too  little  of  every  kind  of  data.  The  histogram  model 
accommodates  both  cases  equally  well. 

2.3  HISTOGRAM  METHODS  AND  MIXTURE  MODELS 

The  Gaussian-signal-in-uniform-noise  mixture  model  used  to  generate  the  above 
examples  was  chosen  because  of  the  basic  role  that  such  models  play  when  using  his- 

I 

togram  techniques.  Specifically,  mixture  models  provide  a  convenient  way  to  explicitly 
model  noise,  which  is  necessary  because  histogram  models  are  very  data  inclusive.  This 
is  most  easily  seen  by  contrasting  the  histogram  approach  to  an  estimator  that  selects 
the  single  maximal  peak.  When  such  a  peak  estimator  is  applied  to  data  with  highly 
concentrated  signal  energy  (e.g.,  narrowband  signal  spectra),  a  beneficial  side  effect  is 
provided  in  the  form  of  signal  cleaning.  That  is,  in  cases  where  the  signal  intensity  is 
largely  confined  to  a  single  bin  and  that  bin  is  correctly  identified,  most  of  the  wideband 
noise  is  eliminated  with  the  omitted  bins.  The  histogram  estimator,  on  the  other  hand, 
throws  away  nothing.  It  must  therefore  explicitly  account  for  the  noise  to  minimize 
noise-induced  errors. 
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Figure  2:  Histogram  Data  as  a  Function  of  Overall  Intensity 
Image  plots  of  histogram  intensity  data  are  synthesized  independently  at 
each  of  150  time  points,  with  intensities  at  each  time  point  obtained  by 
sampling  from  a  two- component  mixture  model  containing  a  Gaussian 
signal  in  uniform  noise.  Samples  are  sorted  into  unit-width  bins  cover¬ 
ing  the  range  [—10,10].  The  Gaussian  component  has  constant  variance 
a2  =  4  and  a  sinusoidally  time-varying  mean.  The  mixture  components 
have  constant  probabilities  ns  =  0.3  (for  signal)  and  nn  =  0.7  (for  noise). 
The  ratio  7r,s/ 7r n  is  the  (wide-band)  SNR.  Intensity  gram  plots  are  shown 
for  overall  intensities  corresponding  to  2500  point  measurements  per 
sample  time  (top),  250  measurements  per  sample  time  (middle),  and  25 
measurements  per  time  (bottom).  9 
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Figure  3:  Histogram  Data  as  a  Function  of  SNR 
Histogram  data  are  synthesized  using  the  same  model  as  in  figure  2. 
Here,  the  total  number  of  measurements  in  all  plots  is  K  —  250  (as  in 
the  middle  plot  figure  2).  The  signal-mode  assignment  probability  varies, 
with  values  irs  —  0.5  (top),  7 rs  =  0.3  (middle),  to  tts  =  0.1  (bottom),  giving 
high,  medium,  and  low  signal-to-noise  ratios,  respectively. 
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Finite  mixture  distributions,  the  topic  of  section  4,  are  ideally  suited  for  noise 
modeling  because  they  provide  a  natural  mechanism  for  dividing  the  energy  between 
a  number  of  component  distributions,  one  or  more  of  which  can  be  tailored  to  noise. 
The  “basic”  PDF  model  for  histogram  methods  is  therefore  the  Gaussian-signal-in- 
uniform-noise  model.  The  use  of  a  uniform  noise  distribution  assumes  that  the  data 
has  been  pre-whitened,  for  example,  using  a  spectral  or  spatial  normalizer.  If  it  is  more 
desirable  to  work  with  un-normalized  data,  however,  a  more  sophisticated  noise  model 
is  required  and  can  be  achieved  using  a  mixture  of  uniform  or  Gaussian  components. 
Multiple  mixture  modes  can  also  be  used  to  describe  the  signal  energy  itself,  say  for 
data  containing  energy  from  multiple  sources  of  interest  or  signals  whose  energy  in  the 
variable  of  interest  is  inherently  multi-modal. 

2.4  TRUNCATED  HISTOGRAM  DATA 

For  a  variety  of  reasons,  intensity  measurements  may  not  be  available  for  some 
histogram  bins.  This  can  happen  unintentionally,  for  example,  when  processing  spectral 
data  from  sensors  with  a  limited  frequency  range.  It  can  also  occur  intentionally,  as 
when  data  are  truncated  or  subsampled  to  reduce  communication  and/or  computational 
requirements,  or  to  isolate  a  phenomenon  of  interest.  Figure  4  shows  a  hypothetical 
example  of  truncated  histogram  data.  While  this  illustration  focuses  on  histogram  “edge 
effects”  where  the  missing  bin  intensities  are  on  the  outer  edges  of  the  distribution, 
missing  intensity  values  can  also  occur  in  the  “interior”  of  the  histogram,  say,  due 
to  unintentional  drop-outs  in  the  sensor  response  or  intentional  subsampling  of  the 
histogram  bins.  When  the  histogram  bins  do  not  fully  cover  the  range  of  the  PDF,  the 
histogram  and  its  data  are  called  truncated.  This  is  in  contrast  to  a  complete  histogram, 
whose  bins  do  cover  the  entire  range. 

If  complete-histogram  algorithms  are  applied  to  truncated-histogram  data,  then 
a  mismatch  exists  between  the  physical  world  (where  a  nonzero  intensity  is  impossible 
m  some  bins)  and  the  mathematical  world  (where  nonzero  intensities  are  possible,  but 
none  just  happened  to  be  observed  in  the  data  at  hand).  The  size  of  the  mismatch 
depends  on  the  probability  mass  under  the  density  function  in  the  truncated  regions. 
This  probability  mass  may  be  very  small,  allowing  the  issue  to  be  ignored  in  some 
applications.  However,  when  modeling  physical  phenomena  whose  energy  can  approach 
the  edges  of  the  observable  measurement  space,  or  when  sensor  malfunction  causes  lost 
sensitivity  in  some  interior  region,  then  the  probability  mass  in  the  truncated  region 
can  be  significant.  The  EM  algorithm  is  used  to  circumvent  this  problem  by  treating 
the  unobserved  intensities  as  missing  data. 
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Figure  4:  Truncated  Histogram  Data. 

This  figure  shows  data  from  the  same  scalar  distribution  as  in  figure 
1,  with  the  continuous  PDF  shown  on  top,  the  synthetic  point  measure¬ 
ments  shown  in  the  middle,  and  the  histogram  bin  intensities  shown  on 
bottom.  In  this  case,  however,  the  histogram  bins  do  not  cover  the  entire 
region  over  which  point  measurements  can  occur.  The  energy  associated 
with  measurements  falling  outside  of  the  observable  region  (i.e.,  outside 
of  the  bold  vertical  lines)  is  not  included  in  any  histogram  bin  and  is 
therefore  lost  in  the  histogram  model.  The  lost  data  are  accounted  for  in 
the  EM  estimation  algorithm  by  utilizing  an  augmented  set  of  missing 
data.  That  is,  the  EM  algorithm  treats  as  missing  data  the  intensity 
values  in  the  truncated  region,  in  addition  to  the  point  measurements 
that  form  the  missing  data  in  the  case  of  a  complete  histogram. 
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The  computational  algorithm  for  the  truncated  histogram  ends  up  being  a  simple 
modification  of  the  algorithm  for  the  corresponding  complete  histogram.  The  theoretical 
work  needed  to  develop  the  algorithm,  however,  has  a  subtlety  requiring  careful  analysis. 
In  particular,  it  is  impossible  to  know  the  total  intensity  (i.e.,  the  sum  of  the  bin 
intensities)  when  some  of  the  bin  intensities  were  not  recorded,  and  the  multinomial 
distribution  is  well  defined  only  when  the  overall  intensity  is  given.  The  uncertainty 
regarding  the  overall  intensity  requires  the  use  of  a  negative  binomial  distribution  and 
its  multivariate  extension,  the  negative  multinomial  distribution.  This  extension  is 
discussed  in  section  3  and  appendix  C.  Fortunately,  the  end  result  of  that  analysis  is  a 
simple  extrapolation  formula  for  obtaining  the  expected  values  of  the  missing  intensities. 
The  EM  auxiliary  function  for  the  truncated  histogram  is  then  a  linear  function  of 
the  missing  intensities,  such  that  the  maximization  step  in  each  iteration  of  the  EM 
algorithm  can  be  formulated  in  terms  of  the  complete  histogram.  The  algorithm  then 
operates  on  an  extended  data  set  in  which  the  observed  intensities  are  augmented  with 
expected  intensities  in  the  truncated  regions. 
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3.  ESTIMATION  FROM  HISTOGRAM  DATA 


The  remainder  of  this  report  focuses  on  the  mathematical  developments  related 


to  histogram  modeling  and  estimation.  In  all  that  follows,  integer  count  data  are  as¬ 
sumed  available  for  each  bin.  That  is,  the  issues  cited  in  the  previous  section  related  to 
converting  from  real-valued  energy  data  to  integer- valued  histogram  data  are  assumed 
to  have  been  dealt  with  appropriately. 

As  mentioned  in  the  previous  section,  one  key  distinction  among  histogram  meth¬ 
ods  involves  whether  or  not  the  histogram  bins  completely  cover  the  range  of  possible 
measurement  values  and  intensity  data  are  available  for  all  bins.  When  estimating  the 
parameter  set  0.  the  values  of  z  for  which  p( z ;  0)  is  well  defined  constitutes  the  mea¬ 
surement  space ,  denoted  Z.  If  data  are  available  for  histogram  bins  that  completely 
cover  Z ,  then  the  histogram  and  its  data  are  complete.  If  there  are  regions  of  Z  for 
which  histogram  data  are  lacking,  then  the  histogram  is  truncated.  This  section  first 
examines  complete  histograms  and  then  extends  those  algorithms  for  truncated  data. 
A  general  relationship  between  the  two  types  of  histograms  is  then  discussed. 

The  notation  used  in  this  section  and  the  logical  steps  in  deriving  algorithms 
closely  parallel  the  discussion  of  the  expectation-maximization  (EM)  method  in  aj>- 
pendix  A.  Note  that  EM  theory  defines  complete  data  generically  to  indicate  the  con¬ 
catenation  of  the  observed  arid  missing  data,  which  is  independent  of  whether  the  his¬ 
togram  bins  fully  cover  the  space  of  the  physical  variable.  To  distinguish  these  different 
types  of  complete  data,  data  from  a  complete  histogram  will  always  be  referred  to  as 
complete-histogram  data.  Any  reference  to  “complete  data”  without  the  “histogram” 
modifier  indicates  the  more  generic  notion  from  EM  theory. 

3.1  ESTIMATION  FROM  COMPLETE  HISTOGRAM  DATA 

A  complete  histogram  partitions  the  measurement  space  into  a  set  of  mutually 
exclusive  and  exhaustive  regions,  which  are  the  histogram  bins  (or  cells),  denoted  Z( 
for  (  —  1 , ....  L.  The  measurement  space  is  thus  decomposed  as 


L 


(1) 


where  the  non-overlapping  nature  of  the  Z(  is  implicit.  Consider  now  a  collection  of 
hypothetical  point  measurements  Z  =  {zj, . . . ,  z*-}.  The  number  of  these  measurements 
whose  value  falls  in  each  bin  comprise  the  histogram  intensity  data,  denoted 


(2) 
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where  the  subscript  C  indicates  complete  histogram  data.  The  overall  intensity  is  then 
defined  as 

L 

K  =  E  m< '  P) 

e=i 

Under  a  given  set  of  parameter  values,  the  unconditional  probability  that  a  measurement 
falls  in  the  £th  bin  is 

$*(©)=  [  dz  p(z  ;  0) ,  (4) 

JZ( 

which,  for  a  complete  histogram,  satisfies 

L 

£  ».(©)  =  1  ■  (5) 

e=  i 

Assuming  that  the  measurements  constitute  K  independent  draws  from  L  categories, 
the  bin  intensities  are  governed  by  the  multinomial  distribution  discussed  in  appendix  B; 
that  is, 

L 

p(Mc  ;  0)  =  c( Mc)  I]  {*<(©)}"“ ,  (6) 

e=\ 

where  c(Me)  is  the  multinomial  coefficient  defined  in  equation  (127).  Prom  the  stand¬ 
point  of  the  EM-based  algorithms  developed  in  this  section,  equation  (6)  is  the  observed 
data  likelihood  function  (ODLF)  since  it  characterizes  the  observed  histogram  counts. 

3.1.1  EM  Algorithm  with  Missing  Measurements 

Definition  of  Missing  Data  and  Auxiliary  Function.  The  observed  histogram 
intensities  provide  a  relative  measure  of  the  number  of  unobserved  point  measurements 
that  “fall  into”  each  bin.  The  missing  data  for  the  EM  algorithm  are  these  unobserved 
measurements.  Within  a  given  bin  (say,  the  £th),  the  measurements  are  denoted 

Z*  =  (zek  :  k=  1, . . .  ,m*  |,  (7) 

and  the  full  set  of  missing  data  contains  measurement  sets  for  all  bins  as 

z  =  (z,  :€=1 . i}.  (8) 

The  complete  data  are  the  concatenated  set  containing  the  observed  bin  intensities 
and  missing  measurements,  and  its  joint  distribution  p(  Z,  Me  ;  0)  is  the  complete  data 
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likelihood  function  (CDLF).  Given  the  CDLF  and  the  posterior  density  p(Z|Mc;©*) 
for  the  missing  data  (with  parameter  estimates  0*  from  the  previous  EM  iteration), 
the  auxiliary  function  is  formed  by  taking  the  conditional  expectation 


Qz(0:0*,Mc)  =  J^dZp(Z\Mc-.e*)  log  p(Z, Mc ; ©) , 


(9) 


where  Jz  dZ  is  the  marginalization  operator  for  the  missing  measurements.  This  marginal¬ 
ization  is  represented  by  the  sequence  of  (possibly  multivariate)  integrations  given  by 
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dZ*  —  i  I  dz\\ •  •  •  /  dzUni 
Izi  JZi 


dz 


LI  * 


I  dzi/mL 

JzL 


(10) 


which  can  be  expressed  using  the  shorthand  notation 

dzik 


L 


L  mg 

^  s  n  n 

e=i  k= i 


(11) 


Care  must  be  exercised  when  using  this  shorthand  notation  because  a  set  of  nested 
integrals  is  clearly  not  equal  to  a  product  of  single-variable  integrals.  However,  due  to 
the  non-overlapping  nature  of  the  integration  regions  (i.e.,  the  Zk),  the  cross  terms  in 
the  nested  integrals  are  zero,  such  that  equation  (10)  is  equivalent  to  equation  (11)  in 
this  case. 


CDLF  and  Posterior  Distribution.  With  missing  measurements,  it  is  perhaps 
easiest  to  start  with  the  posterior  distribution  of  the  missing  data  and  then  derive 
the  CDLF  from  it.  With  no  other  knowledge,  each  measurement  is  governed  by  the 
distribution  p( :  0).  Because  the  measurement  z«-  is  know'll  to  reside  in  the  £th  bin, 
however,  its  distribution  in  this  restricted  region  is 


p{ztk  |  Ze ;  0)  = 


p(z*fc ;  Q) 
*<(©) 


(12) 


The  intensities  provide  the  numbers  of  independent  measurements  in  each  bin.  Given 
these,  the  likelihood  of  the  group  of  measurements  is  just  the  product  of  the  likelihoods  of 
each  measurement.  The  posterior  of  the  measurements  given  the  intensities  is  therefore 

p(zimC;0)  =  nn*«i*;e)  =  n  n  -}•  <i3> 

e=i  k= i  #=i  k=i  K  ty  ’  } 

The  CDLF  is  derived  from  equations  (6)  and  (13)  by  applying  Bayes’  rule  and  canceling 
terms,  which  gives 


L  mg 

P( Z.Mc;0)  =  p(Mc;0)p(Z|Mc;0)  =  c(Mc)  jJ  p(z^  5  ©)  (14) 

^=1  k= 1 
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The  log-CDLF  is  then  given  by 


L  me 

logp(Z,Mc;0)  =  logc(Mc)  +  ^^  logp(ztt;0).  (15) 

£=1  k=  1 

Dropping  the  term  logc(Mc),  which  does  not  depend  on  0,  the  auxiliary  function 
becomes 

P  L  me 

Qz(0;0*,Mc)  =  /  dZp(Z|Mc;©*)^^logp(z,fc;©).  (16) 

e=  i  fc=i 


Conditional  Expectation.  Due  to  the  independence  of  the  measurements  and  the 
structure  of  the  posterior  distribution  in  equation  (13),  the  conditional  expectation 
operation  can  be  expressed  as 


J  dZ  p(Z|Mc  ;©’)  =  n  fl  {  X  .  (17) 


where,  noting  the  definition  of  4^(0*)  in  equation  (4),  each  component  in  this  expression 
satisfies 


1 


dzek  p(zfk ;  ©*)  =  1  • 


(18) 


When  the  conditional  expectation  operation  in  equation  (17)  is  applied  to  the  log  of  the 
CDLF,  it  applies  individually  to  each  of  the  log  terms  that  appear  after  the  summa¬ 
tions  in  equation  (16).  Furthermore,  the  components  in  the  conditional  expectation  all 
integrate  to  one,  except  for  those  in  which  the  indices  of  the  integration  variable  match 
the  indices  of  the  log  term  being  operated  on.  The  auxiliary  function  therefore  becomes 


L 

Qz(0;©*,Mc)  =  ^ 
e=\ 


1 

$7(0*"] 


dz(k  p( z(k  ;  0*) 


log  p( ztk  ;  0) . 


At  this  point,  the  indices  £  and  k  on  the  measurements  have  no  significance  because 
the  integration  is  carried  out  independently  for  each  summand  (i.e.,  the  integration 
is  “inside”  the  summation  operations),  and  because  there  is  no  actual  set  of  observed 
measurements  into  which  one  might  have  to  index.  The  indices  £  and  k  are  therefore 
dropped  from  the  measurements  to  obtain  the  expression 


L  p 

Qz(©;@',Mc)  =  £  Jz  rfzp(z;©')  log  p(z ;  ©) . 


(19) 
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At  the  level  of  generality  considered  thus  far,  further  developments  are  impossible  be¬ 
cause  maximization  of  the  auxiliary  function  (the  M-step)  requires  a  particular  form  for 
p( z  ;  0).  The  M-step  is  carried  out  below,  however,  for  the  special  case  of  a  multivariate 
normal  density. 

M-Step  for  a  Gaussian  Model.  As  an  example,  consider  estimating  the  mean  vector 
jj,  and  covariance  matrix  P  in  the  Gaussian  distribution 


In  this  case,  the  auxiliary  function  given  in  equation  (19)  becomes 


where  p*  and  P*  are  the  existing  estimates  from  the  previous  EM  iteration.  The 
estimators  for  the  mean  and  covariance  are  given  in  terms  of  a  set  of  sufficient  statistics 
for  the  posterior  distribution.  In  addition  to  the  bin  probability 


(22) 


these  sufficient  statistics  include  the  centroid  and  spread  variables  for  each  histogram 
cell,  which  are  given  respectively  as 


(23) 


Jzt 


f2 ((&*)  =  [  dz  J\f(z:  /j,*,  P*)  zz1  .  (24) 

JZ( 

Computational  algorithms  for  equations  (22)-(24)  are  given  for  the  case  of  scalar  mea¬ 
surements  in  appendix  E.  The  estimators  for  the  mean  vector  and  covariance  matrix 
are  derived  in  appendix  D  and  are  given  by 


(25) 


(26) 


where  flp  is  the  center-shifted  second  local  moment,  which  is  computed  from  the  centroid 
and  spread  variables  as 


S\(0*,/x)  =  +  fi2  $,(©*). 


(27) 
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As  shown  in  appendix  D,  estimators  n  and  P  are  located  at  a  critical  point  of  the 
auxiliary  function,  and,  due  to  the  concave  structure  of  the  auxiliary  function,  they  also 
locate  the  global  maximum. 

3.2  ESTIMATION  FROM  TRUNCATED  HISTOGRAM  DATA 

When  histogram  intensity  data  for  various  bins  or  regions  are  unavailable  for 
some  reason,  then  the  estimator  needs  to  acknowledge  that  there  are  unknown  values 
in  those  bins  or  regions,  instead  of  assuming  that  they  are  zero.  The  EM  algorithm 
handily  addresses  this  problem  by  letting  the  unobservable  intensities  be  missing  data. 

To  establish  notation,  let  Z()  denote  the  observable  region  of  the  measurement 
space,  which  is  partitioned  into  Lo  histogram  cells  as 

L  O 

Z0  =  \JZt.  (28) 

e=i 

Further,  let  Zj  denote  the  truncated  region,  where  particles  are  not  counted  (e.g.,  at 
the  edge  of  an  image  or  at  either  end  of  a  frequency  band).  This  region  is  partitioned 
into  Lt  cells  as 

L 

ZT  =  |J  2, ,  (29) 

(=Lo + 1 

where  L  =  Lo  +  L t  is  the  total  number  of  cells  that  cover  the  concatenated  space 

Z  =  Z( )  U  Z'f  .  (30) 

The  unconditional  probability  of  a  particle  in  the  observable  region  is 

*0(0)  =  [  dzp(z:Q)  =  £  *,(©),  (31) 

J  Z0  £=1 

and  the  particle  probability  for  the  truncated  region  is 

*T(0)  =  f  dz  p(z;  0)  =  £  *,(0).  (32) 

e=L0+ i 

Since  Zq  and  Zj  satisfy  equation  (30),  *o(0)  and  *t ( 0 )  satisfy 

*o(0)  +  *t(0)  =  1  •  (33) 

The  collection  of  observed  intensities  is  denoted 
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M0  =  |  me :  £  =  1, . . .  ,L0  j 


(34) 


(35) 


and  the  total  number  of  observed  particles  is 

Lo 

Kq  =  ^  mi. 
(=  1 


Since  the  observed  intensities  represent  Kq  draws  on  Lo  categories,  Mo  is  multinomially 
distributed.  But  since  the  unconditional  probabilities  of  the  observed  histogram  bins 
do  not  sum  to  one,  the  ODLF  is  given  by 


f  1  ~K° 

{ *o(e) } 


n  { *<<©) }”' . 

^=1 


(36) 


The  E\1  algorithm  for  optimizing  this  likelihood  function  is  developed  in  two  stages. 
First,  the  unobserved  intensities  are  treated  as  the  sole  piece  of  missing  data.  This  is 
then  extended  to  the  case  of  missing  measurements  and  intensities. 


3.2.1  EM  Algorithm  with  Missing  Hit  Counts 

Consider,  for  a  moment,  estimating  the  distribution  parameters  using  an  EM 
algorithm  that  treats  the  truncated  intensities  as  missing  data  but  does  not  include 
the  measurements  in  the  missing  data.  The  resulting  algorithm  is  of  little  practical 
interest  since  the  M-step  likely  involves  a  difficult  nonlinear  problem.  This  situation  is 
considered,  however,  to  isolate  the  issues  involved  with  truncated  bin  intensities.  After 
getting  an  understanding  of  these  issues  in  the  present  subsection,  the  next  subsection 
takes  on  the  case  where  both  the  truncated  intensities  and  measurements  are  included 
in  the  missing  data. 

Definition  of  Missing  Data  and  Auxiliary  Function.  For  the  present  discussion, 
the  missing  data  are  the  intensities  in  the  truncated  region,  denoted 

Mp  =  |  mi  :  £  =  Lq  +  1, . . . ,  L  | ,  (37) 

and  the  auxiliary  function  is  defined  as 

QMt(0:0*.Mo)  =  Y,  P(MT|M0;  0*)  log  p(MT,  M0  ;  ©) .  (38) 

Mt 

The  marginalization  operator  for  the  missing  intensities  is  expressed  using  the  shorthand 
notation 

L  (  oo  ^  oo  oo  oo 

E  =  n  E  *  E  E  E  <») 

MT  £=Lq  +  1  l  rne=()  J  ™^0  +  i  — 0  mL0+ 2=0  mL—  0 
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Similar  to  the  notation  introduced  in  equation  (11),  this  shorthand  notation  is  only 
accurate  when  the  cross-terms  in  the  nested  summations  are  zero,  in  which  case  the 
nested  sum  does  indeed  reduce  to  a  product  of  individual  summations. 

CDLF  and  Posterior  Distribution.  The  posterior  distribution  of  the  missing  in¬ 
tensities  is  derived  in  appendix  C,  and  is  given  by 

K  l  m 

p(MT|Mo;0)  =  c~(Mt,K0)  {<f>o(0)}  J]  {$,(©)  p,  (40) 

t=Lo+l 

where  c“(Mt,  Kq)  is  the  negative  multinomial  coefficient  defined  in  equation  (140). 
The  CDLF  is  given  in  terms  of  the  ODLF  and  posterior  distribution  as 


p(Mt.  Mo  ;  0)  =  p(M(> ;  0)  p(Mx  |  Mo  ;  ©)  (41) 

L 

=  c( Mo)  c"( Mt,  Ko)  n  {  *<(©)  }mt  -  (42) 

£=1 

whose  logarithm  is 

L 

logp(MT,  M0  ;  ©)  =  log  c(M0)  +  logc_(MT,  Ko)  +  Y  me  log$*(0) .  (43) 

e=i 

Since  the  first  two  terms  in  equation  (43)  do  not  depend  0,  they  can  be  dropped  from 
the  auxiliary  function  to  obtain 

L 

Qmt(©:0*,Mo)  =  Y,  p(Mt|Mo;0*)  Y  loR^(0)-  (44) 

yVfx  £=1 


Conditional  Expectation.  The  expression  in  equation  (44)  is  a  linear  function  of  the 
missing  bin  intensities,  with  respect  to  which  the  expectation  is  being  taken.  Because 
of  this  linearity,  the  expected  value  of  the  log-CDLF  is  obtained  merely  by  substituting 
the  expected  values  of  the  missing  data,  leading  to  the  expression 

L 

(?mt(©:0*.Mo)  =  Y  log $*(©) ,  (45) 

e=i 

where  the  expected  intensities  are  defined  by 

\m(  if  £  =  1, . . . ,  L0 

me  =  {  (  ^  (46) 

[F{m,|Mo;0*}  if  t  =  LQ  +  1, . . . ,  L . 
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The  expected  intensities  for  the  truncated  cells  are  derived  in  appendix  C  and  are  given 
for  t  =  Lq  +  1, . . . ,  L  as 

E{m,|Mo;0-}  =  (47) 

This  expression  extrapolates  the  intensities  from  the  observed  region  into  the  truncated 
region  by  using  the  ratio  {  Ao/^ol©*)}  as  a  “probability-to-intensity”  conversion  fac¬ 
tor  and  then  applying  this  conversion  factor  to  the  unconditional  probability  in  each 
truncated  bin. 


3.2.2  EM  Algorithm  with  Missing  Hit  Counts  and  Measurements 

The  results  of  the  previous  subsection  are  now  extended  such  that  the  missing 
data  include  both  the  truncated  intensities  and  measurements. 


Definition  of  Missing  Data  and  Auxiliary  Function.  The  missing  data  are 
defined  by  the  sets 


My  = 


Z  = 


I  me :  t  =  Lq  +  1, .. .  ,L  j  , 
:£=!,...  ,L,  k  =  l,..., me  j  . 


(48) 


The  auxiliary  function  is  defined  by 

Qz,mt(0;0*,Mo)  =  V  [  dZ  p(Z. MT| Mo; ©*)  logp(Z, MT, M0 ; ©) ,  (49) 

mt  Jz 

where  the  marginalization  operators  for  the  missing  data  are  again  expressed  using  the 
shorthand  notation 


E 


L 


ri 


//z  -  n  n  { l  } 

JZ  e=i  k=i  '  JZt  > 


CDLF  and  Posterior  Distribution.  The  distribution  p(Z,Mt,Mo;0)  can  be 
factored  using  Bayes’  rule  as 

p(Z,  Mt.  Mo  ;  0)  =  p(Mq  ;  0)  p(My  |  Mo  ;  0)  p( Z  |  Mt,  Mo;  0) 

-  p(M0  ;  0)  p(Mt  |  Mo ;  0)  p(Z  |  Mc;  0) ,  (50) 
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where  p(Z|  Mx,  Mo;  0)  =  p(Z|  Mo;  ©)  because,  once  the  missing  intensities  are  given, 
the  concatenated  set  {Mo,  Mj}  corresponds  to  the  set  of  intensities  for  the  complete 
histogram.  Equation  (50)  serves  as  the  starting  point  both  for  obtaining  the  posterior 
distribution  of  the  missing  data  and  for  evaluating  the  CDLF  to  be  substituted  into 
equation  (49).  A  suitable  expression  for  this  latter  purpose  is  obtained  by  substituting 
equations  (36),  (40),  and  (13),  and  canceling  terms,  which  gives 

L  TTl£ 

p( Z,Mt,Mo;0)  =  c(Mo)  c-(MT,tfo)  nn  p(zfk:@).  (51) 

The  log  of  the  CDLF  is  therefore  given  by 


^=i  fc=i 


L  me 

logp(Z.MT,M0  ;0)  =  logc(Mo)  +  logc_(MT,  K0)  +  E  E  logp(zR.  ;0),  (52) 

e= i  fc=i 

such  that,  after  ignoring  constant  terms  logc(Mo)  and  logc“(Mx,  A'o),  the  auxiliary 
function  becomes 

r  L  rnt 

Qz,mt (©;©*,  Mo)  =  E  /  <*Zp(Z,MT|Mo;0*)  E  E  ^gp(zek]@).  (53) 


Mt 


=1  k=  1 


The  missing-data  posterior  distribution  is  obtained  by  dividing  equation  (50)  by  p(Mo  ;  0), 
which  gives 


p(Z,  MT  |  M0;  ©*)  =  p(Mt|Mo;©*)p(Z|Mc;0*). 


(54) 


Conditional  Expectation.  The  expectation  operation  in  equation  (53)  can  be  de¬ 
composed  as 

E  /  dZp(Z,MT|M0;©’)  =  Ep(Mt|Mo;©*)  f  dZ  p(Z|Mc;©*),  (55) 

Mt  ^  Mt  ^ 

allowing  the  auxiliary  function  to  be  expressed  as 
Qz.MT  (0:0*,  Mo)  =  Ep(mt|Mo;©*) 


Mt 


p  L  TUf 

x  /  dZp(Z|Mc;©-)  £X>gp(Z„;©).  (56) 


t=  1  fc=l 

Noting  equation  (16),  the  second  line  of  this  expression  is  just  the  auxiliary  function  for 
the  complete  histogram,  such  that 


Qz.mt(©:0,,Mo)  =  £p(Mt|Mo;0-)  Qz(0:0-,Mc) 


(57) 


Aiq 
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Substituting  the  final  form  for  Qz(©;©*\Mc)  given  in  equation  (19)  again  leaves  an 
expectation  of  a  function  that  is  linear  in  the  missing  intensities.  Paralleling  the  case 
in  which  only  the  intensities  are  missing,  the  auxiliary  function  simplifies  to 

L 


Qz.mt(8;  S',  M0)  =  Y'  <r,7ri'l  [  <lz  p{z\  @4 )  log  p(z:  @) , 

.  1  h, 


(58) 


h, 

where  fh(  is  defined  in  equation  (46).  Once  again  it  is  impossible  to  go  any  further 
without  imposing  a  form  on  p(x;0).  For  the  special  case  of  Gaussian  measurements, 
maximization  of  equation  (58)  over  the  mean  vector  n  and  covariance  matrix  P  yields 
estimates  that  are  equal  to  those  in  equation  (25)  and  equation  (26),  but  with  me  re¬ 
placed  with  rhe  and  K  replaced  with  K  =  The  Gaussian  parameter  estimates 

after  each  iteration  of  the  EM  algorithm  with  truncated  histogram  data  are  thus 

L 


= 


me 


1<  j-(  $,(©*) 


p  =  iy 


me 


«>*(©*), 


n«(©*,A), 


k  tr  *<(©*) 

where  u>((©*)  and  ilt(0*,n)  are  defined  in  equations  (23)  and  (27),  respectively. 


(59) 


(60) 


3.3  RELATIONSHIP  BETWEEN  COMPLETE  AND  TRUNCATED  HIS¬ 
TOGRAMS 

In  the  analysis  of  truncated  histograms,  the  auxiliary  function  with  missing  mea¬ 
surements  and  histogram  intensities  reduced  to  an  expected  version  of  the  auxiliary 
function  for  the  complete  histogram.  This  occured  because  of  the  similarity  of  the 
CDLF  for  the  two  situations.  It  is  useful  for  later  developments  with  mixtures  to  flesh 
out  this  equivalence  more  generally.  Consider  the  generic  problem  in  which  the  missing 
data  contain  some  set  of  variables  2.  For  a  complete  histogram,  the  CDLF  in  most 
cases  of  practical  interest  can  be  expressed  as 

p(2,Mc;©)  =  p(Mc;0)  p(H|Mc;0) 

L 

=  c( Mc)  I]  {*<(©)}"“  p(S|Mc;0).  (61) 

<=i 

The  CDLF  for  the  truncated  histogram,  on  the  other  hand,  is  given  by 
p(2,Mx,Mo;@)  =p(Mt.Mo;0)  p(E|Mc;0) 

L 

=  c(M0)  c"(Mt,  Ko)  n{^(©)f  p(S|Mc;0).  (62) 

t=i  ^ 
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With  the  exception  of  the  coefficients,  these  two  expressions  are  identical.  Substituting 
the  definitions  of  the  multinomial  and  negative  multinomial  coefficients  in  the  CDLF 
for  the  truncated  histogram  gives 


c( Mo)  c“(Mt,  Kq) 


K 


o 


(I\  —  1) ! 


{nil  ! }  ( Ko  ~  1) !  {nJUo+1  ! } 


Ko 

Ko 


K\ 


{nti  m^!} 

C(  Me) , 


(63) 


such  that  the  two  CDLFs  are  related  by  the  expression 

p( B,  Mt,  Mo  :  0)  =  p( 3,  Mc ;  ©) .  (64) 

This  expression  gives  an  indication  of  the  likelihood  “loss”  that  results  from  the  unob¬ 
servability  of  some  of  the  intensities.  This  relationship  between  the  CDLFs  results  in  a 
similar  relationship  between  the  auxiliary  functions  for  the  two  problems.  The  auxiliary 
function  for  the  truncated  histogram  can  be  written  as 

Qs,mt(©;©*,Mo)  =  £>(M t|Mo;©*)  f  dSp(E|Mc;0*)  log p(E, MT, M0 ; 0) 

Mt 

=  ]T>(  Mt|Mo;0*)  f  dE  p(S|Mc;  ©*) 

Mt 

x  { log  (7^)  +  losp(s’ Mc :  0) }  • 

The  random  variable  K  is  a  function  of  ©*,  but  not  of  0.  Since  Kq  is  observed,  the 
term  \og(Ko/ K)  is  independent  of  0.  and  the  auxiliary  function  effectively  becomes 


Qe,mt(0;0*,Mo)  =  X>(Mt|Mo:0*)  Qs(0;0*,Mc). 

M[t 

Furthermore,  the  parameter  estimators  for  the  complete  histogram  usually  depend  lin¬ 
early  on  me,  as  was  the  case  in  the  previous  subsection.  The  approach  used  above 
therefore  generalizes  to  most  cases  of  interest;  that  is,  the  truncated-histogram  esti¬ 
mators  are  obtained  by  substituting  the  expected  intensities  into  the  corresponding 
complete-histogram  estimators.  One  thus  never  need  explicitly  derive  the  estimator  for 
the  truncated  histogram,  so  long  as  the  complete-histogram  estimator  exhibits  this  lin¬ 
ear  dependence  on  the  cell  intensities.  Since  this  is  also  the  case  with  the  mixture  models 
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discussed  in  the  next  section,  the  results  are  formulated  for  complete  histograms  only. 
The  qualifiers  (i.e.,  subscripts  I,  C,  and  T)  on  histogram-related  variables  are  therefore 
dropped  in  the  next  section,  with  the  understanding  that  variable  rri(  in  a  given  ex¬ 
pression  is  an  observed  intensity  for  cells  in  Zq  and  an  expected  intensity  for  cells  in 

Z>T- 
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4.  HISTOGRAM  MIXTURE  MODELS 


This  section  discusses  parameter  estimation  for  finite  mixture  models,  which  in¬ 
troduce  another  form  of  missing  data,  namely  the  mode  assignments.  Subsection  1 
outlines  the  signal-in-noise  mixture  model;  subsection  2  reviews  estimation  from  point 
measurements,  which  illustrates  in  isolation  thp  issues  surrounding  mode  assignment 
uncertainty.  Subsection  3  then  generalizes  this  to  estimation  from  histogram  intensity 
data,  wherein  the  point  measurements  and  inodes  assignments  are  missing  data. 

4.1  MODEL  DEFINITION 

Let  the  distribution  of  interest  (i.e.,  the  “signal  distribution” )  be  denoted  ps( z:  Os). 
At  issue  is  the  fact  that,  in  noisy  data,  some  of  the  measurements  do  not  belong  to 
ps(z:Os),  but  instead  belong  to  some  extraneous  distribution  po(z\0o)  (he.,  the  “noise 
distribution”).  If  one  attempts  to  estimate  9  s  in  the  signal  distribution  without  account¬ 
ing  for  the  noise  measurements,  then  the  resulting  parameter  estimates  are  distorted. 
Noise  measurements  are  accommodated  using  the  two-mode  “signal-or-noise”  mixture 
distribution  given  by 

p( z  :  7r0, 0O,  Os)  =  tto  Po(z-  00)  +  (1  -  n0)  ps( z :  Os) ,  (65) 

where  7Tq  is  the  unconditional  probability  that  a  measurement  belongs  to  the  noise.  In 
general,  the  model  distributions  Ps(z',Os)  and  po(z;0n)  can  have  different  parametric 
structures  (e.g.,  Gaussian  signal  distribution  and  uniform  noise  distribution). 

From  this  simple  statement  of  the  model,  more  sophisticated  models  are  obtained 
by  letting  the  individual  signal  or  noise  components  themselves  be  mixtures,  resulting 
in  a  mixture  of  mixtures.  This  is  most  valuable  when  ps(z:0s)  and/or  po(z:#o)  are 
either  unknown  or  so  complicated  that  they  lead  to  intractable  estimation  problems. 
In  the  discussion  below,  a  separate  mixture  is  included  for  the  signal  distribution  only; 
a  mixture  model  for  the  noise  component  is  constructed  similarly.  Thus,  the  signal 
distribution  is  expressed  in  terms  of  modal  distributions  as 

j 

Ps{ z ;  Os)  =  ^3  Pj(z:  ei )  >  (66) 

j= i 

where  pj( z;  0j)  is  the  j\h  modal  distribution  and  is  the  unconditional  probability  that 
the  j th  mode  is  valid  (i.e.,  that  z  is  actually  governed  by  the  j tli  modal  distribution). 
The  collection  of  these  probabilities  satisfies  the  constraint  Y2j=i  tpj  =  1-  Substituting 
equation  (66)  into  equation  (65),  and  defining  7 ij  =  (1  —  7r0)  ipj  for  j  =  1, . . . ,  J,  results 
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(67) 


in  an  overall  mixture  distribution,  including  signal  and  noise  models,  given  by 

j 

p{ z;©)  =  Y  Pj(z:0i)- 

j= o 

This  model  is  parameterized  by  the  set  0  —  {7r,  do,  Os},  where  7r  —  {7To,  7Ti,  . . . ,  7Tj} 
is  the  collection  of  unconditional  mode  assignment  probabilities,  d0  contains  the  noise 
distribution  parameters,  and  Os  =  {d\,  ■  ■  ■  ■ Oj }  is  the  collection  of  parameters  for  all 
modes  of  the  signal  distribution.  The  right-hand  side  of  equation  (67)  is  a  convex 
combination  since  %2j=o  =  !• 

4.2  ESTIMATION  FROM  POINT  MEASUREMENTS 


The  assignment  uncertainty  issue  is  the  defining  characteristic  of  finite  mixture 
densities.  This  issue  is  now  considered  in  isolation  by  considering  the  problem  in  which 
a  set  Z  of  independent  point  measurements  are  observed.  While  the  resulting  algo¬ 
rithm  is  of  value  in  certain  cases  in  which  feature  values  of  observed  data  samples  are 
directly  observed  (e.g.,  see  [9]),  this  is  a  rather  artificial  problem  in  the  context  of 
histogram  modeling  and  is  presented  merely  as  a  stepping  stone  to  the  more  general 
problem  description  in  the  next  subsection.  In  any  case,  when  observations  of  the  point 
measurements  are  assumed  to  be  available,  an  optimal  parameter  estimation  algorithm 
maximizes  the  ODLF  given  by 


Direct  optimization  of  this  likelihood  function  with  respect  to  0  is  generally  difficult 
because  of  the  interaction  between  mixture  modes. 


P(  Z;0)  =  1 


Y  Pik(zk’0h) 


(68) 


h= i 


4-2.1  EM  Algorithm  with  Missing  Assignments 

The  EM  algorithm  for  mixtures  circumvents  the  mode  interaction  issue  by  treating 
as  missing  data  the  mode  assignments  (i.e.,  the  jk).  The  idea  behind  this  choice  is 
that,  if  the  mode  assignment  were  known  for  each  measurement  (i.e.,  if  the  component 
distribution  actually  governing  each  measurement  were  known),  then  the  measurements 
could  be  grouped  according  to  mode  and  the  parameter  estimation  problem  would 
decompose  into  J  independent  problems  that  are  easier  to  solve.  The  EM  algorithm 
does  induce  such  a  decomposition,  albeit  in  each  iteration  of  an  iterative  procedure. 

Definition  of  Missing  Data  and  Auxiliary  Function.  The  set  of  missing  data 
corresponding  to  the  observed  measurements  in  Z  is  denoted  as 

J  =  |  ji,h,  ■  ■  ■ , 3k  | ,  (69) 
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such  that  the  auxiliary  function  is  given  by 


gj(0;0*,Z)  =  ^2  p(  J|Z:0*)  log  p(Z,  J;  0) . 

J 


(70) 


Given  that  the  measurements  are  independent,  the  joint  density  is  a  product  of  terms 
that  allows  the  missing-data  marginalization  operation  to  be  expressed  using  the  short¬ 
hand  notation 


K 


E-n  El- 

J  k= 1  l  jk= 1 


(71) 


CDLF  and  Posterior  Distribution.  The  CDLF  is  obtained  by  noting  the  in¬ 
dependence  of  the  measurements  and  by  applying  Bayes’  rule  on  a  measurement-by- 
measurement  basis,  which  gives 

K  K 

P( J.Z;©)  =  Yl  p{jk:Q)  p(zk\jk;&)  =  H  irjk  pjk (zfc; 0jk) ,  (72) 

k=  1  k—1 

where  p(jk\ 0)  =  njk  and  p(zk\jk]  ©)  =  pJk(zk;0Jk).  The  log-CDLF  is  therefore 

K 


iogp(J. z ; ©)  =  J2  { log 71  j* +  eh) } 


(73) 


k- 1 

Bayes’  rule  then  gives  the  posterior  distribution  as  to  obtain 

K 


„/T \rj  r^*\  _  P( J.Z;0*)  _  -p-j- 

P(  J|  Z  ,  0  )  p(Z;0*)  II  7^(0  )’ 


(74) 


k= 1 


where  7 fejfc(©*)  =  pOklz*;©*)  is  the  single-measurement  posterior  assignment  proba¬ 
bility,  which  is  defined  (and  computed)  as 

*5,  PjM'O'jJ 


7^(0')  = 


{Ei. 


(75) 


For  all  values  of  k,  these  posterior  probabilities  satisfy 

j 

£  w©*)  =  i. 


(76) 


3k  =  1 


Conditional  Expectation.  Given  the  structure  of  equation  (74),  the  conditional 
expectation  in  equation  (70)  is  a  sequence  of  single-variable  expectation  operations, 
which  is  expressed  as  a  product  of  operators  as 


K 


£>(j|z;e*)  =  J]  y>«(e-) 

J  k= 1  l  jk= 1 


(77) 
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If  evaluated  as  it  stands,  then  all  of  the  single-variable  expectations  in  this  expression 
sum  to  one  (hence  the  overall  expression  is  one).  Similarly,  when  the  conditional- 
expectation  operator  is  applied  to  the  log-CDLF  in  equation  (73),  all  of  the  single¬ 
variable  expectations  sum  to  one,  except  those  for  which  the  marginalization  index 
matches  the  index  jk  appearing  in  the  log-CDLF.  The  auxiliary  function  therefore  re¬ 
duces  to 

K  J 

Qj(0;  ©*,  Z)  =  £  £  7 kjk(&)  {  logTr^  +  log pjk(zk]  0jk)  }  .  (78) 

fc=l  jk= 1 

At  this  point,  the  index  k  on  the  assignment  variable  adds  no  meaning  since  the  sum¬ 
mation  covers  all  possible  values  of  the  assignment.  Furthermore,  since  jk  is  missing 
data,  there  is  no  actual  “observed  value”  for  jk  that  is  tied  to  a  particular  measurement 
zfe.  This  index  is  therefore  dropped  to  obtain 

K  J 

Qj(&:  ©*,  Z)  =  22  227kj(&*)  (logTTj  +  logpj(zk;0j)J  .  (79) 

fc=i  j= i 

Without  the  /^-dependence,  the  summation  over  j  can  be  moved  in  front  of  the  summa¬ 
tion  over  k,  and  the  auxiliary  function  can  be  decomposed  into  a  collection  of  component 
auxiliary  functions,  each  of  which  depends  on  a  separate  subset  of  the  model  parameters. 
This  decomposition  is  defined  as 


where 


Qj(0:  ©*,  Z)  =  J2  ©*,  Z)  +  22  Qj(0j ;  0*.  Z) , 


i= i 


Qj(7T,;©*,Z)  =  22  7 log TTj,  , 


Ar=l 


(80) 


(81) 


K 

Q3(0j-  0%  Z )  =  22  7 *,-(©•)  logp^z*;  0j) .  (82) 

fe=i 

Given  the  mutually  exclusive  nature  of  the  parameters  in  these  various  components,  the 
M-step  of  the  EM  algorithm  decomposes  into  a  set  of  independent  optimization  prob¬ 
lems,  one  for  each  component.  The  assignment  probabilities  are  obtained  by  maximizing 
equation  (81),  and  the  parameters  in  each  mode  density  are  obtained  by  maximizing 
equation  (82). 


Estimation  of  Assignment  Probabilities.  Equation  (81)  is  maximized  using  La¬ 
grange  multiplier  techniques  to  obtain  the  estimator 


7 r,- 


%(e-) 

K 


(83) 
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where  n3  (©•)  is  the  effective  number  of  measurements  corresponding  to  mode  j  given 

by 


K 

«i(©*)  =  £We‘)-  (84) 

k= 1 

The  effective  numbers  of  measurements  for  all  modes  satisfy 

j 

*j(e'>  =  K-  (85) 

j= 1 


Gaussian  Mode  Estimation.  Each  component  in  equation  (82)  is  used  to  individ¬ 
ually  estimate  the  parameters  in  one  of  the  modal  distributions.  For  Gaussian  mode 
densities,  Pj(z\0j)  =  M{z\  jti7,Pj),  the  optimization  problem  decomposes  into  J  inde¬ 
pendent  (single- mode)  Gaussian  estimation  problems,  which  are  discussed  in  appendix 
D.  Drawing  on  the  results  of  that  appendix,  equation(82)  is  maximized  by 

1  K 

frj  =  (q*\  )  zfc>  (86) 

A  ’  k= 1 

1  A 

pj  =  X!  (z*  -  K-^)1  •  (8?) 

These  Gaussian  parameter  estimators  take  the  above  form  regardless  of  the  parametric 
structure  of  any  other  modes.  That  is,  not  all  modes  need  be  Gaussian. 


4.3  ESTIMATION  FROM  HISTOGRAM  DATA 

To  set  the  stage  for  histogram  mixture  estimation,  the  unconditional  probability 
of  any  given  histogram  cell  under  the  mixture  model  in  equation  (67)  is  given  by 


<I> 


(0)  =  f  p(z\  0)  =  X  ^  /  dz  pffz;  Of) . 

J  Zc  „• n  **  Zc 


(88) 


zi  j= 0  Jzt 

This  can  be  alternatively  expressed  by  defining  the  “mode-bin  probability” 


j{dj)  -  /  dz  Pj(z]  0j) , 
Jze 


(89) 


such  that  the  overall  bin  probability  is 


M®)  =  Y2'nJ  fajWj)- 

3=0 


(90) 
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The  “observed”  data  are  the  intensity  vector 


M  =  |  me :  i  =  1 . L  j  , 


(91) 


and  the  total  number  of  measurements  is 


L 


K  =  m  e . 


(92) 


As  discussed  in  section  3.3,  the  intensity  vector  may  contain  place-holders  for  expected 


intensities  in  the  truncated  region  of  a  truncated  histogram. 

4-3.1  EM  Algorithm  with  Missing  Assignments  and  Measurements 

Definition  of  Missing  Data  and  Auxiliary  Function.  For  histogram  estimation 
of  mixture-distribution  parameters,  the  missing  data  include  the  measurements  and 
mode  assignments,  which  are  denoted 


The  auxiliary  function  is  then  defined  as 

QJ  Z(0;0*,M)=  [  dZ  V  p(J,Z|M;©*)  logp(J,  Z,  M ;  0) ,  (93) 

Jz  j 

where  the  marginalization  operators  for  J  and  Z  were  defined  in  equations  (71)  and 
(11),  respectively,  and  are  given  for  the  current  case  as 


J  f=l  fc=l  f  jek=  o  J 

CDLF1  and  Posterior  Distribution.  The  CDLF  is  obtained  using  Bayes’  rule  as 
p(J.  Z,  M;0)  =  p{ M ;  0)  p( J.  Z  |  M ;  0) 


L  mi 


(94) 
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whose  logarithm  is 


L  mi 


logp(J.  Z.  M ;  0)  =  logc(M)  +  ££{  log  +  log Pjik  (zek  7  •  (9^) 


*=U  k=  1 


The  posterior  distribution  of  the  missing  data  is  then  obtained  by  dividing  equation 
(94)  by  the  definition  of  p(M  ;  0),  given  in  equation  (6),  which  yields 


mi 


p(J.  Z|  M:  ©*)  =  ] 


f=l  k= 1 


7ilkPm(zfh]eik) 

*/(©*) 


(96) 


Conditional  Expectation.  Given  equation  (96),  the  conditional  expectation  opera¬ 
tor  is  a  product  of  operators  given  by 


/<«£  P(J.Z|M:©-)  =  fl  f[  ( 

Jz  J  i=  1  k= i  l  n  '  j«.=0  */2‘ 


(97) 


Except  when  operating  on  some  function  containing  z each  term  in  the  large  brackets 
is  unity  since 

<we-)  =  y;  r1(k 

itk= o 

Therefore,  as  in  previous  cases,  when  the  conditional  expectation  operator  is  applied 
to  the  CDLF,  all  components  of  the  expectation  marginalize  to  one  except  those  for 
which  the  £  and  k  in  the  expectation  match  the  £  and  k  in  the  CDLF.  In  fact,  it  would 
be  more  correct  notationally  (but  much  uglier)  to  denote  the  indices  in  the  expectation 
operation  as  £'  and  k'  to  distinguish  them  from  the  £  and  k  that  appear  in  the  CDLF,  and 
then  introduce  a  Kronecker  delta  function  for  bookkeeping.  Whether  said  in  symbols 
or  words,  the  net  result  is  that  all  expectation  terms  marginalize  to  one  except  those 
for  which  £'  =  £  and  k'  =  k.  Noting  all  of  these  “unit  marginalizations”  and  dropping 
the  term  logr(M),  the  auxiliary  function  reduces  to 


I, 


dz«-  ft,.!  z<j.:  ) 


(98) 


L  mi 


Qjz(0:C-r,M)  =  2^  2^ 

t=\  k= 1 


*<(©*) 


£  *  / 

3tk= 0  JZ( 


dz(k  PJek{z(k-d*  ) 


X 


+  l°gPj<*  (z(k ;  0jtk 


At  this  point,  the  indices  £  and  k  on  the  missing  variables  j  and  z  impart  no  infor¬ 
mation  since  the  summation  over  the  assignment  variable  and  the  integration  over  the 
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measurement  covers  all  possible  values,  independent  of  i  or  k.  They  are  thus  dropped 
to  obtain 

L  mi  J 

Qj,z(©;  ©*,  M)  =  ^2  $  (Q*\  /  dz  p^z :  d^{  log7rJ  +  1(W(Z  5  °j)  }  • 

e=\  fc= i  A  '  j= o  Jzi 

Without  the  dependence  on  t  and  k ,  the  summation  over  j  can  be  moved  forward,  and 
the  sum  over  k  becomes  a  constant  multiplier  me,  allowing  the  auxiliary  function  to  be 
expressed  as 


Qj,z(©i©*,M)  =  ]r<3j,z(^;©-,M)  +  ]r  Qj,z(0,;0-,M), 


(99) 


3=0 


3=0 


where  the  components  in  this  expression  are  defined  as 

L  . 

Qj,z(Wi;©*,M)  =  7 Tj  Jz  dzPj(Z'e*j)  iog W|,  (100) 

L  - 

Qj,z(0j;©*,M)  =  t t*  jz  dzPj(z-dj)  log Pj(z-ej)-  (ll)1) 

With  the  exception  of  the  factor  n*,  the  form  of  this  last  component  is  identical  to 
the  auxiliary  function  Qz(0j-;©*,M)  for  the  non-mixture  case.  During  the  M-step, 
when  equation  (99)  is  differentiated  with  respect  to  0j  to  find  a  critical  point,  the 
derivative  is  n*  times  the  derivative  of  Qz(#/<  ©*•  M).  When  equated  to  zero,  the 
7r*  term  drops  out  and  optimization  of  ©*,  M)  is  achieved  by  independently 

optimizing  Qz(0j]  ©*,  M)  for  each  Oj,  making  all  of  the  earlier  (non-mixture)  results 
applicable.  That  said,  the  discussion  below  for  Gaussian  densities  includes  the  factor 
7T*  in  order  for  the  effective  number  of  measurements  to  fall  out  as  an  intermediate 
variable.  In  effect,  this  just  means  that  the  expressions  for  and  Pj  include  a  term 

*j/*j  =  L 

Estimation  of  Assignment  Probabilities.  Optimization  of  the  assignment  proba¬ 
bilities  is  performed  using  equation  (100)  directly.  A  convenient  form  for  that  expression 
is  obtained  by  noting  the  definition  of  <fv(0),  and  defining 


7<?(©*)  = 


Izt  dzPj(z’,0j) 


{£/=o7ri  Li  dzPj(zrfj)} 

With  these  so  defined,  the  expression  for  0*.  M)  then  becomes 


(102) 


Qj,z(^i©‘.M)  =  <  ^2  1  logTT,-. 


(103) 


£=1 
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Aside  from  the  weight  m t,  this  is  identical  in  form  to  the  case  with  observed  measure¬ 
ments.  The  EM  update  for  the  assignment  probability  is  thus  given  by 

1  L 

12  mt  (1U4) 

e=\ 

As  was  the  case  for  observed  measurements,  this  expression  is  independent  of  the  form 
of  the  modal  distributions. 


Gaussian  Mode  Estimation.  When  the  jth  mode  is  Gaussian,  the  corresponding 
component  of  equation  (101)  becomes 

L  r 

<?j,z(0j;0*,M)  =  7 Tj  Jz  log  A/' (z;  At,-,  P  j)  •  (105) 

Substituting  the  Gaussian  density  in  the  log-CDLF  term  of  the  auxiliary  function  gives 
Qz(8r,e-M)  =  |  log  Ip;1  |  £  ^  dzN(  z;M;,P-) 

T 

[  dz  N(z\  Mj'  Pj)  (z-Mj)1  PJ1  (z-Mj)  •  (100) 

J  Zf 


tE 


2  *,(©*)  JZt 

To  parallel  the  non-mixture  case,  it  is  convenient  to  define  the  weighted  “mode-specific’' 
local  moments 


«>0(«5)  =  [  dzM( p;)  z, 

Jzt 

(107) 

&ej{0j)  =  [  dz  jV(z;  fi*  P*)  zz1 

JZc 

(108) 

Also,  the  effective  number  of  measurements  corresponding  to  the  jth  mode  is 

=  ' 

(109) 

While  the  meaning  of  this  variable  is  the  same  as  in  section  4,  the  definition  is  altered 
relative  to  equation  (84)  to  accommodate  the  missing  measurements. 

Again  drawing  on  the  results  of  appendix  D,  the  estimators  for  the  mean  and 
covariance  of  the  jth  (Gaussian)  component  are 

1  L 

*  1  TTl£  *  f  M\ 

^  ~  Kj(e*)  fzt  *«(©*)  j)’ 

(110) 

Pl  =  Kj(e-)  t  4s-)  ’S'WMh 

(111) 
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whore  f 2^(0*,  =  JZ(  dz  N{z:  fx*.  P*)  (z  —  jx,j)(z  —  /j7)t  is  the  mode-specific  center- 

shifted  second  local  moment,  computed  as 

=  toej(0j)~  2Aj  +  n]  4>tj{Q*j)  •  (112) 

Estimating  a  Scalar  Gaussian  Signal  in  Uniform  Noise.  If  the  measurement 
space  is  scalar,  the  signal  distribution  contains  a  single  Gaussian  component,  and  the 
noise  is  considered  to  be  uniform  over  the  observable  region  of  measurement  space,  then 
the  model  distribution  is  the  two-component  mixture  given  by 

p{z-TTs,/u,a)  =  nsAf(z,ix,a2)  +  (1 -ns)  ,  (113) 

where  w  =  max(Zo)  —  min(Zo)  is  the  width  of  the  observable  region  and  of  the  uniform 
noise  distribution.  Note  that  the  above  expression  has  been  parameterized  in  terms 
of  the  mixing  weight  7rs  for  the  signal  component,  as  opposed  to  parameterizing  in 
terms  of  7To  =  1  —  ns  as  was  done  earlier  in  the  section  to  accommodate  mixture  signal 
distributions. 

One  nice  property  of  the  model  in  equation  (113)  is  that  the  noise  component 
contains  no  unknown  parameters.  The  unconditional  bin  probabilities  for  the  noise 
component  can  therefore  be  computed  once  at  the  outset  of  the  estimation  algorithm; 
they  need  not  be  updated  in  each  EM  iteration  since  nothing  changes.  Figures  5  and 
6  show’  EM  iteration  results  for  the  Gaussian-signal-in-uniform-noise  model  under  two 
scenarios.  In  generating  both  figures,  the  model  w’as  used  to  synthesize  a  number  of 
measurements,  these  measurements  were  binned  into  a  set  of  histogram  cells,  and  the 
histogram  intensities  were  used  to  estimate  the  model  parameters.  For  figure  5,  a 
very  large  number  of  synthetic  measurements  w’as  generated  and  binned  such  that  the 
histogram  has  a  large  overall  intensity  and  the  individual  intensities  are  quite  accurate. 
For  figure  6,  a  much  smaller  number  of  synthetic  measurements  was  generated  such  that 
the  histogram  has  a  small  overall  intensity  and  the  individual  intensities  are  noisy.  As 
one  might  expect,  the  final  EM  estimates  computed  from  the  noisy  histogram  data  are 
degraded  from  those  computed  from  the  accurate  histogram  intensities.  They  are  still 
quite  close  to  the  true  values,  however.  Furthermore,  as  figure  6  suggests,  the  estimate 
of  the  mean  (i.e.,  the  location  parameter)  is  far  superior  to  w’hat  would  be  obtained  if 
the  location  of  the  largest  histogram  connt  wore  chosen  (i.e.,  peak  picking). 


38 


Weighted  Mixture  Components 


Observed  Hit  Counts  and 
Approximate  Distribution 


Figure  5:  Gaussian- Uniform  Mixture  Estimation,  High- Intensity  Data 
Histogram- based  parameter  estimation  for  a  two-component  mixture  dis¬ 
tribution  with  Gaussian  and  uniform  components.  Histogram  data  were 
obtain  by  binning  K  =  106  measurements  that  were  generated  from  the 
ideal  mixture  distribution  with  parameters  p  —  4,  a  —  1,  and  ns  —  0.4.  The 
large  number  of  measurements  gives  “clean”  histogram  data,  and  the  es¬ 
timated  parameters  are  extremely  accurate  at  p  =  3.99937,  rr  =  1.00260,  and 
7rs  =  0.39925.  The  first  and  last  rows  correspond  to  the  initial  and  final 
parameter  estimates,  respectively.  The  second  and  third  rows  correspond 
to  intermediate  iterations  of  the  EM  algorithm. 
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Weighted  Mixture  Components 


Observed  Hit  Counts  and 
Approximate  Distribution 


-10  0  10 


-10  0  10 


Figure  6:  Gaussian- Uniform  Mixture  Estimation,  Low- Intensity  Data 
Histogram  estimation  of  parameters  in  the  two- component  mixture  dis¬ 
tribution  described  in  the  caption  of  Figure  5.  In  this  case,  however,  pa¬ 
rameters  are  estimated  from  “noisy”  histogram  data  with  K  =  200.  The 
final  parameter  estimates  are  p  =  4.03696,  a  =  1.18893,  and  ns  =  0.49568. 
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5.  SUMMARY 


This  report  has  examined  histogram  estimation  techniques  for  inferring  the  statis¬ 
tical  properties  of  physical  variables  from  observed  intensity  data.  The  motivation  for 
using  these  estimation  algorithms  is  to  reduce  bias  and  other  artifacts  that  can  occur 
when  traditional  peak-picking  or  simple  interpolation  algorithms  are  applied  to  intensity 
data,  particularly  when  the  data  exhibit  significant  spreading  in  the  variable  of  interest. 
The  report  opened  with  a  high-level  consideration  of  the  theory,  highlighting  some  of 
the  primary  characteristics,  issues,  capabilities,  and  limitations  of  the  theory  and  algo¬ 
rithms.  Great  emphasis  was  placed  on  the  fact  that  histogram  techniques  are  based  on 
discrete  stochastic  theory  (for  integer-valued  intensity  data)  and  the  issues  that  arise 
when  the  theory  and  techniques  are  applied  to  real-valued  acoustic  energy  data.  As 
it  turns  out,  this  does  not  present  a  significant  difficulty  except  in  Bayesian  contexts 
where  the  maximum  likelihood  (ML)  estimate  must  be  weighted  with  a  prior  distribu¬ 
tion.  The  relative  weighting  of  the  prior  and  ML  distributions  remains  a  significant 
open  issue  when  using  histogram  methods  with  real-valued  energy  data. 

After  the  introduction  and  consideration  of  issues  involved  with  application  of  the 
methods,  the  histogram  estimation  algorithms  were  developed  from  first  statistical  prin¬ 
ciples  by  drawing  on  significant  background  material  provided  in  a  set  of  appendixes. 
The  algorithms  were  “built  up”  by  considering  the  simplest  cases  first  and  then  adding 
complexity.  The  in-depth  tutorial  examination  given  here  provides  the  detailed  theoret¬ 
ical  developments  that  were  omitted  from  earlier  works  such  as  McLachlan  and  Jones 
[2].  This  report  also  limited  the  discussion  to  static  distributions,  which  are  subtle 
and  interesting  enough  to  warrant  consideration  in  and  of  themselves.  Limiting  the 
discussion  to  the  static  case  also  avoids  the  significant  unresolved  issues  and  notational 
baggage  that  is  incurred  when  histogram  techniques  are  considered  in  the  context  of 
dynamic  tracking  algorithms  (e.g.,  see  [3]  and  [5]).  The  overall  goal  of  this  report  was 
to  arm  the  reader  with  adequate  background  and  insights  to  apply  and  extend  these 
powerful  and  versatile  methods  for  a  variety  of  applications. 
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APPENDIX  A 

EXPECTATION-MAXIMIZATION  ALGORITHMS 


The  goal  of  this  appendix  is  to  introduce  the  basic  ideas,  terminology,  and  no¬ 
tation  for  the  expectation-maximization  (EM)  approach  to  algorithm  design,  which  is 
the  common  thread  through  all  of  the  algorithms  discussed  in  this  report.  The  EM 
approach  provides  a  general  template  for  generating  iterative  maximum-likelihood  al¬ 
gorithms  that  estimate  the  parameter  vector  ©  in  a  probability  density  function  (PDF) 
p(x;  ©),  given  observations  of  the  variable  random  variable  x.  The  EM  approach  is 
most  useful  when  it  is  difficult  to  directly  maximize  p(x:  0)  with  respect  to  0,  but 
there  exists  an  auxiliary  variable  £,  such  that  maximizing  the  joint  PDF  p(x,  £;  0)  is 
straightforward.  The  approach  replaces  a  difficult  nonlinear  optimization  problem  with 
an  iterative  sequence  of  easier  problems.  The  EM  method  is  extensively  used  in  modern 
statistical  analysis  because  it  greatly  simplifies  algorithm  development  for  many  prol>- 
lems,  it  has  guaranteed  convergence  under  some  fairly  general  conditions,  and  many 
statistical  models  have  obvious  choices  for  missing  data.  The  general  properties  of 
the  EM  algorithm  are  discussed  in  [1],  and  numerous  applications  and  extensions  are 
discussed  in  the  text  by  McLachlan  and  Krishnan  [10]. 

Observed,  Missing,  and  Complete  Data.  EM  algorithms  all  share  the  charac¬ 
teristic  of  having  observed,  missing,  and  complete  data.  The  observed  data  consist  of 
samples  of  data  that  are  at  hand,  representing  either  physical  measurements  from  a 
sensor  or  features  that  are  computed  from  physical  measurements.  The  collection  of 
such  samples  is  denoted  X  =  {xn  :  n  =  1, . . . ,  N}.  The  density  function  p(X;  0)  is 
referred  to  as  is  the  observed-data  likelihood  function  (ODLF). 

The  missing  data  contain  the  information  that,  were  it  known  in  addition  to  the 
observed  data,  causes  the  difficult  estimation  problem  to  reduce  to  a  simpler  one.  For 
a  given  set  of  observed  data  X,  the  corresponding  collection  of  missing  data  is  denoted 
H  =  (£m  :  m  =  1.....A/}.  The  concatenated  set  containing  both  the  observed  and 
missing  data  is  referred  to  as  the  complete  data ,  and  the  joint  distribution  p(X,  S;  0)  is 
referred  to  as  the  complete-data  likelihood  function  (CDLF).  As  mentioned  above,  the 
fundamental  premise  of  EM  is  that  p(X,  S;  0)  is  much  easier  to  optimize  with  respect 
to  0  than  is  p(X:  ©). 

Iterative  Structure  and  Auxiliary  Function.  The  EM  algorithm  is  iterative.  It 
requires  an  initial  estimate  for  0,  which  is  obtained  using  an  application-dependent  sub- 
optimal  algorithm  (the  preferred  method),  or  by  random  selection  within  some  reason¬ 
able  range  of  values.  Denoting  the  initial  estimate  by  0°,  the  EM  algorithm  generates 
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the  sequence  of  estimates  defined  by 

©I+1  =  arg max  Qs(©;  ©\ X) ,  (114) 

where  Q =(©;©',  X)  is  the  so-called  auxiliary  function ,  which  will  be  formally  defined 
in  a  moment.  While  the  literature  typically  denotes  the  auxiliary  function  simply  as 
Q(0;©*),  this  report  discusses  a  number  of  different  auxiliary  functions  with  differ¬ 
ent  sets  of  observed  and  missing  data.  To  more  easily  distinguish  these  various  cases, 
the  notation  used  here  includes  a  subscript  to  denote  the  missing  data  and  an  addi¬ 
tional  argument  to  denote  the  observed  data;  thus  the  subscript  S  and  argument  X  in 
Q3(©;©*,X). 

WThen  iterating  equation  (114)  to  optimize  the  ODLF,  the  iterations  are  termi¬ 
nated  either  at  a  pre-selected  number  of  iterations  or  when  some  set  of  convergence 
criteria  is  satisfied.  When  convergence  tests  are  used,  they  axe  typically  based  on  the 
relative  change  in  the  observed-data  likelihood  and/or  values  of  the  estimates,  similar  to 
convergence  tests  for  standard  nonlinear  optimization  techniques  like  the  Newton  and 
gradient  ascent  algorithms. 

Equation  (114)  includes  the  iteration  number  as  an  index.  When  using  the  EM 
approach,  there  are  often  a  large  number  of  other  indices  that  track  a  number  of  other 
characteristics,  making  index  variables  a  rare  notational  commodity.  It  is  therefore 
convenient  to  define  the  parameter  estimates  as  ©  =  ©,+1  and  ©*  =  ©',  such  that  the 
“typical”  EM  iteration  is 

0  =  arg  max  Q =(©:  ©*,  X) .  (115) 

0 

The  auxiliary  function  Qs(0;0*,X)  is  defined  as  the  expectation  of  the  log  of 
the  CDLF,  conditioned  on  the  posterior  distribution  of  the  missing  data,  which  is  stated 
mathematically  as 

Qh(©;©*,X)  =  E'six  ©*  {  log  p(X.  S;  0)  j 

=  J  da  p(S|  X,  ©*)  log  p(X,  S;  ©) .  (116) 

Here  the  integral  notation  J  d  S  is  used  to  indicate  marginalization  over  the  missing 
data  S;  in  general,  this  is  a  sequence  of  continuous  integrations,  discrete  summations, 
or  both.  Since  missing  data  with  both  continuous  and  discrete  variables  are  common, 
sequences  that  mix  integrations  and  summations  are  also  common. 

The  basic  idea  behind  EM  is  well  illustrated  from  the  viewpoint  of  iterative  mi- 
norization  (IM),  which  is  an  even  more  general  class  of  algorithms  to  which  the  EM 
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method  belongs.  Figure  A-l  shows  two  iterations  of  a  generic  IM  algorithm.  As  dis¬ 
cussed  in  the  figure  caption,  each  iteration  maximizes  an  auxiliary  function  that  “mi- 
norizes"  the  function  being  maximized,  which  for  the  EM  algorithm  is  the  ODLF.  That 
the  EM  auxiliary  function  minorizes  the  ODLF  follows  from  the  nature  of  missing  data, 
in  the  sense  that  including  a  stochastic  nuisance  parameter  in  a  model  always  drives 
down  the  likelihood  relative  to  the  nuisance-free  model  because  of  the  added  uncertainty 
associated  with  the  nuisance  parameter. 

Expectation  and  Maximization  Steps.  Each  EM  iteration  consists  of  an  ex¬ 
pectation  step  (E-step)  and  a  maximization  step  (M-step).  The  El-step  involves  eval¬ 
uating  equation  (116),  and  the  M-step  involves  computing  estimates  that  maximize 
equation  (116)  with  respect  to  0.  The  process  for  deriving  the  auxiliary  function  in 
the  E-step  is  fairly  universal  across  applications  and  is  largely  an  exercise  in  conditional 
probability.  The  steps  are  to  (1)  define  the  observed  data  and  the  ODLF,  (2)  define 
the  missing  data  and  the  CDLF,  (3)  determine  the  posterior  distribution  of  the  missing 
data,  and  (4)  carry  out  the  conditional  expectation  to  obtain  the  auxiliary  function.  In 
many  applications,  the  missing  data  are  chosen  such  that  the  conditional  distribution 
/)(X|  S;  0)  can  be  written  in  closed  form,  and  there  exists  a  known  unconditional  (prior) 
distribution  p( E:0).  In  such  cases,  the  CDLF  is  obtained  as 

P(X,  3;0)  =  p(X|S;©)p(S;©).  (117) 


Given  the  ODLF  and  CDLF\  the  posterior  for  the  missing  data  is  given  by 


P(S|X;©) 


p(X,  5;  0) 
P(X;  ©) 


(118) 


In  other  applications,  it  may  be  more  convenient  to  first  derive  the  posterior  distribution 
p(S|X:@)  and  then  obtain  the  CDLF  as 


p(X,  3;  0)  =  p(a |  X;  0)  p(X;  0) .  (119) 


The  preferred  order  for  evaluating  these  distributions  depends  on  the  structure  of  the 
missing  data  and  the  distributions  involved,  but  some  form  of  these  steps  will  be  en¬ 
countered  whenever  EM  is  applied  in  a  new  situation.  The  final  stage  in  deriving  the 
auxiliary  function  involves  carrying  out  the  conditional  expectation,  which  is  usually 
simplified  by  noting  independence  or  conditional  independence  among  the  variables. 
More  will  be  said  about  this  process  in  the  context  of  particular  problems. 

While  the  details  of  the  M-step  cannot  be  specified  without  formulating  a  partic¬ 
ular  parameterization  of  the  CDLP\  these  details  typically  follow'  from  standard  con- 
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Figure  A- 1:  Iterative  Minorization  (IM) 

Two  IM  iterations  are  shown;  the  first  depicted  in  red  and  the  second  in 
green.  The  function  L(0)  to  be  maximized  is  shown  in  black.  The  first 
iteration  maximizes  an  auxiliary  function  Q(6*\0’ _1),  which  “ minorizes ” 
the  likelihood  function  at  O' _1;  that  is,  Q{0^\0e~l)  is  strictly  less  than  L(0) 
at  every  point  0  except  6  ~l ,  where  the  two  functions  share  the  same 
function  value  and  first  derivative  (i.e.,  the  two  functions  are  tangent  at 
Oe~x).  Under  these  minorization  constraints,  maximization  of  Q(8e\0(~1) 
is  guaranteed  to  increase  L{0)  unless  Of~l  is  already  a  stationary  point  of 
L{0),  in  which  case  no  changes  take  place.  The  second  iteration  repeats 
the  process  with  the  tangent  constraint  imposed  at  the  point  0(  produced 
by  the  first  iteration. 
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strained  or  unconstrained  optimization  methods.  For  example,  a  stationary  point  is  ob¬ 
tained  by  equating  to  zero  the  derivative  of  Qs(&:  0*.  X)  with  respect  to  each  element 
of  0.  The  stationary  point  is  a  maximum  if  the  second  derivative  (Hessian)  matrix  is 
negative  definite  (i.e.,  all  of  its  eigenvalues  are  strictly  negative)  at  the  stationary  point. 
Fortunately,  the  auxiliary  function  is  concave  for  many  problem  formulations  involving 
Gaussian  or  Gaussian-mixture  model  densities,  which  means  that  a  stationary  point  for 
such  an  objective  function  is  the  unique  maximum.  Appendix  D  demonstrates  that  the 
auxiliary  function  for  Gaussian  histogram  estimators  is,  indeed,  concave. 
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APPENDIX  B 

COMBINATORIAL  PROBABILITY  DISTRIBUTIONS 


Histogram-based  methods  are  inherently  combinatorial,  since  they  consist  of  look¬ 
ing  at  all  combinations  of  ways  that  events  can  be  grouped  into  a  collection  of  histogram 
cells.  This  appendix  reviews  the  binomial,  negative  binomial,  and  multinomial  distri¬ 
butions  to  establish  notation  and  to  give  an  easy  single  reference  for  these  basic  distri¬ 
butions. 


Binomial  Distribution.  Perhaps  the  simplest  experimental  situation  involves  a 
series  of  Bernoulli  trials,  whose  outcomes  are  limited  to  one  of  two  categories,  “success” 
or  “failure”.  A  common  question  with  regard  to  Bernoulli  trials  concerns  the  number  of 
successes  that  might  be  observed  over  the  course  of  n  independent  trials.  This  situation 
is  described  by  the  binomial  distribution.  Let  the  probability  of  success  in  any  single 
trial  be  denoted  by  0,  such  that  the  probability  of  failure  is  (1  —  0).  To  have  exactly  m 
successes  in  n  independent  trials,  there  must  also  be  (n  —  m)  failures.  The  probability 
of  the  m  successes  is  0m;  the  probability  of  the  (n  —  m)  failures  is  (1  —  0)(n-"d;  and 
there  are  b(m,  n)  ways  in  which  this  situation  can  occur  in  n  trials,  where  ft(m,  n)  is  the 
binomial  coefficient: 


6(m,  n) 


n ! 


m\(n  —  m) ! 


(120) 


The  binomial  distribution  measures  the  probability  of  the  event  “m  successes  out  of  n 
trials.”  The  distribution  is  therefore  defined  as 


p(m ;  0,  n)  =  b(m ,  n)  0m  (1  —  0)^n  ,n) , 


(121) 


which  satisfies  the  marginalization  constraint 

n 

^2  p(m;0,n)  =  1.  (122) 

m= 0 


Negative  Binomial  Distribution.  In  the  binomial  distribution,  the  number  of  trials 
is  a  parameter,  and  number  of  successes  is  a  random  variable.  Suppose,  however,  that 
the  question  is  turned  around  to  ask  “How  many  trials  are  needed  to  achieve  a  certain 
number  of  successes?”  Here,  the  number  of  successes  is  the  parameter,  and  the  number 
of  trials  is  the  random  variable.  This  situation  is  described  by  the  negative  binomial 
distribution 


p(n ;  0,  m)  =  b  (n,  m)  0m  (1  -  0)(n  m) , 


(123) 
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where  b  (n,  m)  is  the  negative  binomial  coefficient ,  defined  as 

(n~  1)! 


b  ( m,n )  =  b(n  —  m  —  1,  n  —  1)  = 


(m  —  1) !  (n  —  m) ! ' 

The  negative  binomial  distribution  satisfies  the  marginalization  constraint 


(124) 


53  <P,m)  =  1. 


(125) 


Multinomial  Distribution.  Now  consider  an  independent  series  of  trials  in  which 
each  trial  has  L  categories  of  possible  outcomes,  rather  than  simply  success  or  fail¬ 
ure.  With  no  other  information,  the  probability  of  an  outcome  occuring  in  the  £th 
category  is  denoted  <pe,  and  the  collection  of  probabilities  for  all  categories  is  <p  = 
{(j)e  '■  (■  —  1 , . . . ,  L}.  When  actually  performing  a  series  of  trials,  an  observed  outcome  in 
one  of  the  categories  is  referred  to  as  a  "hit”  in  that  category.  The  number  of  hits  ob¬ 
served  in  each  category  over  the  series  of  trials  is  denoted  mt  for  i  =  1, . . . ,  L.  These  “hit 
counts”  are  collected  in  the  vector  M  =  {me  :  l  =  1, . . . ,  L},  whose  statistical  properties 
are  governed  by  the  multinomial  distribution 


L 

P(M;0)  =  c(M)  II  C- 

e=i 

where  c(M)  is  the  multinomial  coefficient ,  defined  as 


c(M) 


(st.  ! 

{rtf.,"*!}' 


(126) 


(127) 


Unlike  the  binomial  distribution,  the  number  of  trials  is  not  an  explicit  parameter 
because  the  multinomial  distribution  implicitly  assumes  that  the  sum  of  the  hit  counts 
for  all  categories  equals  the  number  of  trials  (i.e.,  that  all  outcomes  are  counted).  If 
this  is  not  the  case  in  a  given  application,  then  the  multinomial  distribution  is  not  the 
correct  model. 

Equation  (126)  assumes  that  any  outcome  not  fitting  into  one  of  the  defined 
categories  has  zero  probability  of  occurrence.  Alternatively  stated,  it  assumes  that, 
with  probability  one,  all  possible  outcomes  fall  within  one  of  the  categories,  such  that 

L 

'52<h  =  l.  (128) 

If,  on  the  other  hand,  events  occur  with  nonzero  probability  that  do  not  fit  any  of  the 
categories,  then  the  statistical  model  must  be  modified.  In  particular,  consider  the  case 
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of  “pre-screened”  data,  where  any  trial  whose  outcome  does  not  fit  a  given  category 
is  ignored  entirely;  not  only  is  the  hit  not  recorded,  but  the  counter  that  records  the 
total  number  of  trials  is  not  incremented.  The  sum  of  hit  counts  equals  the  number  of 
recorded  trials,  so  the  general  form  of  the  multinomial  distribution  is  still  valid.  The 
probabilities  for  the  categories,  however,  must  be  normalized  to  account  for  the  fact  that 
the  universe  of  outcomes  has  been  projected  down  to  the  union  of  the  known  categories. 
Defining  the  probability  that  a  hit  occurs  in  any  of  the  categories  as 

L 

4>  =  Y,  (129) 

allows  the  multinomial  distribution  for  this  case  to  be  defined  as 

p(M  ;0)  =  c(  M)  n  (f)‘  -  c(M)0-m  nc.  (130) 

i  W '  t=i 

where  m  is  the  total  number  of  trials  that  pass  the  screening  process;  that  is, 

L 

m  =  E  W(.  (131) 

#=i 

Negative  Multinomial  Distribution.  When  applying  the  multinomial  distribution 
to  pre-screened  data,  the  distribution  is  useful  for  making  inferences  only  about  things 
that  occur  within  the  observable  categories.  It  is  sometimes  desirable  to  extrapolate 
inference  into  categories  that  cannot  be  observed.  This  situation  can  be  thought  of  in 
terms  of  Bernoulli  trials  by  grouping  the  observable  categories  into  one  single  super¬ 
category.  The  universe  of  possible  outcomes  can  then  be  partitioned  into  two  classes, 
with  a  measurement  either  falling  within  the  observed  super-category  (a  “success”) 
or  falling  outside  of  this  super-category  (a  “failure”).  Treating  the  sum  of  observed 
hit  counts  as  the  number  of  successes  in  a  sequence  of  Bernoulli  trials,  the  unknown 
total  number  of  trials  required  to  have  generated  these  successes  is  a  random  variable 
governed  by  the  negative  binomial  distribution  in  equation  (123),  with  (p  defined  by 
equation  (129)  and  m  given  by  equation  (131).  This  analysis  can  be  taken  a  step  further 
by  subdividing  the  space  of  unobservable  outcomes  and  making  inferences  about  these 
unobservable  categories.  This  last  situation  is  described  by  the  negative  multinomial 
distribution,  which  is  discussed  further  in  appendix  C. 
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APPENDIX  C 

STATISTICS  OF  MISSING  HIT  COUNTS 


This  appendix  derives  the  posterior  distribution  p(Mt|Mo;0)  of  the  missing  in¬ 
tensities  given  the  observed  histogram  eounts,  and  then  uses  this  posterior  to  obtain 
the  expected  intensities  in  the  unobserved  histogram  cells. 

Posterior  Distribution.  Development  of  the  posterior  density  involves  making  a 
number  of  structural  observations  concerning  the  component  distributions  involved. 
This  discussion  is  facilitated  by  defining  the  “number  of  truncated  measurements”  as 
the  discrete  random  variable 

L 

At  =  Tne.  (132) 

e=L0+ 1 

The  total  number  of  (observed  and  unobserved)  measurements  is  then  the  discrete 
random  variable  defined  by 


Ac  =  Ao  +  At  •  (133) 

Observing  that  the  regions  Zo  and  Zj  are  disjoint  in  Z.  the  posterior  distribution 
satisfies 


p(MT|Mo;  0)  —  p(Mx  |  K0 ;  0)  ■  (134) 

That  is,  from  the  standpoint  of  Z\.  it  does  not  matter  how  the  measurements  in  Zq 
are  distributed  within  Zo\  only  the  total  number  of  hits  in  Z0  is  important  since 
that  provides  evidence  for  inferring  the  total  number  of  truncated  measurements.  The 
second  structural  observation  concerns  the  appearance  of  these  hit-count  totals  in  the 
distribution  functions.  For  example,  note  that  in  the  presence  of  Mt,  the  variable  At 
carries  no  additional  statistical  information,  such  that 

p(MT  |  Ao  ;  6)  =  p(MT,  A'tI  A'o  ;0)  S  ( AT  —  ^  me  j  , 

\  (=Lo  +  i  / 

where  <5(-)  is  the  Kronecker  delta  function,  whose  value  is  one  for  an  argument  of  zero 
and  zero  otherwise.  The  delta  function  is  introduced  to  ensure  that  the  distribution  has 
nonzero  probability  only  when  equation  (132)  is  satisfied.  Having  said  this,  this  delta 
function  will  be  dropped  to  ease  the  notational  burden,  with  the  understanding  that 
equation  (132)  is  indeed  satisfied.  A  similar  observation  can  be  made  concerning  Kc 
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and  At-  That  is,  given  the  observed  hit  total  A'o,  the  variables  A'c  and  At  carry  the 
same  information,  such  that 

p(KT  |  A'o  ;  6)  =  p(Kc  |  Ao  ;  0)  5(KC  -  A'o  -  A'x) .  (135) 

As  before,  the  Kronecker  delta  is  dropped,  with  the  understanding  that  equation  (133) 
is  satisfied.  Given  these  observations,  the  posterior  distribution  for  the  missing  data  is 
given  by 


p{ Mt  |  A0  ;  0)  =  p(Mt,  A't  |  A'o  0) 

=  p( Mt  |  At,  K0 ;  0)  p(KT  \  Kq  ;  0) 

=  p(Mt\Kt,0)P(Kc\I<o]0),  (136) 


where  the  first  term  in  the  last  expression  follows  because  once  At  is  given,  A'o  does 
not  bring  any  additional  information  concerning  Mt.  The  variable  Mt  merely  splits 
the  A't  measurements  among  the  Ax  bins  in  Zi,  independent  of  what  happens  in  Zo- 
The  conditional  distribution  p(Mt  |  A't  ;  0)  thus  corresponds  to  A't  independent  draws 
from  At  categories,  and  is  given  by  the  multinomial  distribution 

-Kt  L  m( 

p(Mt|A't;0)  =  c(Mx)  {<f>T(0)j~  I]  {MO)}  ■  (137) 

£=Lo  +  l 

The  distribution  p(Kc  \  A'o  ;  0)  pertains  to  the  total  number  of  Bernoulli  trials  A'c  that 
must  be  executed  in  order  to  achieve  A'o  successes,  where  a  “success”  is  a  hit  anywhere  in 
Zej.  The  total  number  of  hits  is  therefore  governed  by  the  negative  binomial  distribution 
in  equation  (123).  Noting  the  relationships  between  A'o,  A't,  and  A'c,  and  those  between 
(p o{0)  and  4> t(0),  the  negative  binomial  distribution  for  this  situation  is  expressed  as 

p{Kc\Ko\0)  =  (A'c,  A'o)  {  <j>o(0)  }A°  {<^(0)}^,  (138) 

where  6“ (A'c,  A'o)  is  the  negative  binomial  coefficient  defined  in  equation  (124).  Sub¬ 
stituting  equation  (137)  and  equation  (138)  into  equation  (136)  then  gives  the  posterior 
distribution  of  the  missing  intensities  as  a  negative  multinomial  distribution,  which  is 
defined  for  the  present  situation  as 


(139) 


p(Mr\Ko\0)  =  c~(Mt,Kq)  {^o(0)}A°  n  {®f(9;}”'. 


Lo+t 
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where  c  (Mt,  Ko)  is  the  negative  multinomial  coefficient ,  defined  as 

(Kc  —  1) ! 


c~( Mt.A'o)  =  c(Mt)  b  (Kc,  Ko)  = 


(A'o-l)!  {ntio+1  m'!} 


(140) 


Expected  Values.  The  expected  intensities  are  obtained  by  computing  the  first  mo¬ 
ment  of  the  posterior  distribution.  Dropping  the  explicit  dependence  on  the  parameter 
vector  9  and  expressing  the  negative  multinomial  coefficient  directly  in  terms  of  the 
missing  intensities  instead  of  total  hit  counts  allows  the  posterior  to  be  written  as 


p(MT\Ko)  = 


+E,'=,.0|1 


<>z°  n  tf"- 


(141) 


(Ko  -  1) !  {nt.io+1  m> !}  '='.0+1 

The  expected  value  of  the  (typical)  missing  hit  count,  m q  is  defined  for  L0  <  (  <  L  as 

E{ roc | Mo}  =  m<  p(Mt|Ao),  (142) 

where  the  marginalization  operator  was  defined  in  equation  (39),  which  is  repeated  here 
for  convenience  as 


l  f  x  1 

E  =  II  E  • 

Mt  +  1  L  tt? ^ — 0  ) 


(143) 


The  key  piece  of  information  for  taking  the  expectation  is  the  normalization  property 
of  the  negative  multinomial  distribution,  that  is, 


X>(Mt|*o)  =  1. 
Mt 


(144) 


Forming  and  simplifying  the  product  p(Mx  |  Ko)  is  greatly  facilitated  by  defin¬ 
ing  the  variable 


mt  =  < 


ni(  if  £  £ 

mt  —  \  if  C  =  £ . 

With  this  variable  so  defined,  the  product  m^p(Mr|  Ko)  can  be  written  as 


(145) 


(Ko  —  1  +  +1  )  •  ' 

mcP(MT|/fo)  =  4 - ,  \  oz°  n 

(A'o  -  1)!  {nEo+1  ™<!}  '='.,11 


(146) 


The  desire  is  to  obtain  an  expression  that  is  equal  to  within  a  scale  factor  of  the  nega¬ 
tive  multinomial  distribution  with  the  variable  me  instead  of  me.  If  such  an  expression 
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can  be  obtained,  then  the  negative  multinomial  portion  marginalizes  to  unity  and  the 
desired  expectation  is  the  scale  factor.  To  get  the  numerator  of  the  negative  multino¬ 
mial  coefficient  in  terms  of  me  while  retaining  the  desired  structure,  it  is  required  to 
compensate  by  introducing  the  variable 

Aq  =  Kq  +  1  (147) 


such  that  the  numerator  is  given  by 


L 

>  +  E 

£=L0  +  1 


!  = 


(148) 


Now,  the  marginalization  property  of  the  negative  multinomial  distribution  is  valid 
regardless  of  the  actual  value  of  A'o,  so  long  as  the  structure  is  retained  and  the  same 
value  appears  everywhere.  The  new  observed  count  variable  is  thus  substituted  and 
compensated  everywhere  it  appears  in  equation  (146)  to  obtain 


mcp(MT|  A'o) 


(Ao-1+EE°+1^)!  £  *  ^ 

{nf-to+i  '«<’}  4,0  t,L o+i 


(149) 


Substituting  and  compensating  the  variable  me  in  the  final  product  term  and  rearranging 
then  yields 


mcp(Mr|  A'o) 


A', 


o 


<t>  o 


( Ko  -  1  +  He=Lo+ 1  m*)  ! 

(Kq  —  1) !  {ni=£,0+i 


4°  ri 


(150) 


Since  the  term  in  brackets  is  exactly  the  negative  multinomial  distribution  in  the  vari¬ 
ables  rhe  and  A'o,  and  the  scale  factor  is  independent  of  any  of  the  truncated  me,  the 
bracketed  term  marginalizes  under  the  expectation,  leaving  the  desired  expression 
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APPENDIX  D 

ESTIMATING  GAUSSIAN  PARAMETERS 


This  appendix  derives  estimates  and  proves  the  maximality  of  those  estimates  for 
parameters  (jl  and  T  in  the  objective  function 


Q(0) 


=  Y'  pt  [  dzM( z;  <,S)  log  .5/1*:  n,T)  , 

(=i  Jz> 


(152) 


where  pp  is  a  non-negative  weighting  function,  M{z\C,,S)  is  the  observed  measurement 
density 


J\f( z;  S)  =  I  27T  S  I  12  exp 


-2(z-0 


Ts-' 


(z  -  C) 


(153) 


and  logA/^z:  p,,T)  is  the  model  likelihood  function  (i.e. ,  the  log  of  the  model  density 
function).  Omitting  the  constant  log(27r)  term,  which  does  not  affect  the  parameter 
estimation  problem,  the  model  likelihood  is 


logAA(z;  T)  =  ~  log 


^  (z  -  aO1  r(z  -  p) , 


(154) 


which  is  parameterized  here  in  terms  of  the  inverse  covariance  matrix  (or  information 
matrix )  I  =  P1  to  ease  the  math.  The  parameter  estimates  that  maximize  the  ob¬ 
jective  function  are  identified  below.  The  integrations  are  carried  out  first  to  obtain  an 
expression  that  is  cast  in  terms  of  a  set  of  effective  measurements.  The  resulting  expres¬ 
sion  is  then  maximized  by  finding  the  stationary  point  of  the  function  and  by  proving 
that  the  function  is  concave,  such  that  the  stationary  point  is  the  unique  maximum. 


Effective  Measurements.  Given  the  form  of  the  model  likelihood  function,  it  is  con¬ 
venient  to  similarly  decompose  the  objective  function  into  determinant  and  quadratic- 
form  components,  giving 


QW  = 


(155) 


where 


1  *7-  f 

m(0)  =  2  2Z  pl  /  dzN{z:  C.S)  log 

£=l  J 
i  £  r 

m{0)  =  o  ^2  pe  /  dzA/"(z;CS)  (z-//)Tr(z  -fj.) 

D _ 1  J 


^=1 


(156) 

(157) 
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The  determinant  component  is  rewritten  by  defining  the  bin  probability  value 


(158) 


and  the  effective  number  of  measurements 


to  obtain 


L 

*  =  yi  P(  4>e 
e=i 


!),  (0)  =  -  log 


(159) 


(160) 


Equation  (157)  is  rewritten  by  expressing  the  quadratic  form  as  the  trace  of  an  outer 
product,  which  gives 


m{0)  = 


\  Y!  pl  f  dz-^(z'  CS)  tr{T(zzf  —  2z /x1  +/X/X1)} 

z  e=1  JZi 

=  ^  ^  tr  {  r  f  dzAf(z\C,S)  (zz1  -2Z/X1  +/x/iT)| 

l  L 

-  P*  fa  tr  {  r  (tie  -  +  /x/i1 )  } 


e=i 

L 


=  \  pe  fa tr  { r  [^( + (w<  - 1*)  (w< -  ^)T] }  > 


(161) 

(162) 


where 


UJ(  =  ~  [  dz  A7(z;  C,  S)  z, 


=  —  [  dz  A7(z;  <.  S)  zzr, 

Ve  Jze 


Of  =  Of  —  up  w/  . 


(163) 


(164) 

(165) 


Vector  Wf  is  the  normalized  first  moment  (centroid)  and  matrix  Of  is  the  normalized 
second  moment  of  A/"(z;  £ .  S)  when  restricted  to  the  region  Zt.  Vector  u>e  is  also  referred 
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to  as  an  “effective  measurement”  for  reasons  that  will  be  made  clear  in  a  moment. 
Matrix  is  the  second  central  moment  (covariance  matrix)  of  Af(z;  C  S)  in  Z(. 

An  interesting  form  of  the  objective  function  is  obtained  by  defining  the  weighted 
sum  of  covariance  matrices 

1  L 

n  =  2  ll  pt  **  (ice) 

£=l 

such  that,  when  7] 2(B)  is  re-combined  with  the  determinant  term  r; \(&),  the  overall 
objective  function  becomes 

L 

Q{9)  =  tr  {rf2  j  +  ^  p(  (pt  log  Af(ut\fM,T)  .  (167) 

e=i 

Aside  from  the  trace  term,  which  reflects  the  cumulative  measurement  uncertainty 
within  all  bins,  the  objective  function  is  equivalent  to  one  in  which  the  effective  measure¬ 
ments  u)(  are  observed  point  measurements.  Indeed,  from  the  standpoint  of  estimating 
the  mean,  this  problem  is  identical  to  one  in  which  the  u)(  are  observed  point  measure¬ 
ments  since  the  trace  term  is  not  a  function  of  //. 

Location  of  the  Stationary  Point.  The  stationary  point  of  the  objective  function  is 
found  by  equating  to  zero  its  partial  derivatives  with  respect  to  the  various  parameters. 
Noting  equation  (161)  and  the  identity 

r\ 

—  tr{Axx1}=2Ax.  ( 168) 

ax  1  J 

the  derivative  of  the  objective  function  with  respect  to  the  mean  is  given  by 


dQ(G)  dm(0)  ^ 

sjr  =  =  rlr  p"t’1 

Equating  to  zero  gives  the  stationary  value  for  the  mean  as 


(170) 


The  information  matrix  is  estimated  by  differentiating  Q{0)  with  respect  to  T, 
equating  the  result  to  zero,  and  substituting  in  place  of  fx  the  stationary  value  jx. 
Expressing  the  objective  function  as  the  sum  of  equations  (160)  and  (162),  and  drawing 
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upon  the  derivative  identities 


log  A  =  A  1 


(171) 


and 


(172) 


(173) 


gives  the  desired  derivative  as 


Equating  to  zero  and  substituting  fj,  then  gives  the  stationary  value 


p  =  r 


Maximality  of  the  Stationary  Point.  A  stationary  point  is  a  local  maximum  of 
the  objective  function  if  the  Hessian  matrix  of  the  objective  function  is  negative  definite 
at  the  stationary  point.  Construction  of 'the  Hessian  matrix,  however,  requires  taking 
second  derivatives  of  the  function  with  respect  to  all  possible  pairwise  combinations  of 
the  elements  of  the  mean  vector  and  information  matrix,  which  can  be  a  formidable  task. 
If  the  objective  function  is  radially  concave,  however,  then  it  has  the  nice  characteristic 
that  there  is  only  one  stationary  point  and  it  is  the  unique  global  maximum.  The 
remainder  of  this  section  parallels  a  proof  by  Liporace  [11]  of  a  test  for  radial  concavity, 
which  does  not  require  forming  all  of  the  second  partial  derivatives.1  Now,  if  the  function 
fails  the  concavity  test,  a  critical  point  may  still  be  a  local  maximum,  so  the  test  is  not 
conclusive  in  that  regard.  If,  on  the  other  hand,  the  function  passes  the  test,  then  no 
further  analysis  is  needed. 

The  basic  idea  behind  the  test  for  radial  concavity,  which  is  illustrated  in  fig¬ 
ure  D-l,  is  to  formulate  a  family  of  one-dimensional  trajectories  along  the  function 
that  cover  every  possible  direction  from  the  stationary  point,  and  then  show  that  these 
one-dimensional  trajectories  are  all  individually  concave.  If  the  objective  function  is 
continuous  at  the  stationary  point,  and  the  “hill  curves  downward”  in  every  direction 

indeed  the  test  outlined  below  would  be  a  test  of  absolute  concavity  were  it  not  for  the  pathological  family  of 
functions  in  which  non-concave  behavior  can  occur  along  contour  lines  that  are  perpendicular  to  radial  lines  emanating 
from  the  critical  point,  which  is  the  motivation  for  the  term  radially  concave.  Even  in  this  pathological  case,  however, 
the  stationary  point  is  still  the  unique  maximum. 
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Figure  D-l:  A  Test  for  Radial  Concavity. 

The  surface  plot  and  equal-value  contour  lines  depict  a  hypothetical  con¬ 
cave  function  f(x)  of  the  two-dimensional  variable  x.  The  “ stem  lines” 
indicate  the  locations  of  the  stationary  point  x  (black),  an  arbitrarily 
chosen  reference  point  xr  (blue),  and  the  points  x  (green)  that  satisfy 
the  constraint  x  =  Xxr  +  (1  —  A)x  for  A  =  0. 1,  0.2 . 0.9.  Over  the  con¬ 

tinuum  0  <  A  <  1,  the  locus  of  x  forms  a  one- dimensional  trajectory 
( a  subsegment  of  which  is  indicated  by  lower  green  line )  that  emanates 
outward  from  x  along  the  line  passing  through  x  and  xr,  but  in  the  di¬ 
rection  opposite  to  xr.  The  upper  ends  of  the  stem  lines  correspond  to 
the  function  values  f(x),  /(xr),  and  /(5c).  Given  particular  values  of  x 
and  xr,  x  is  completely  determined  by  X,  such  that  f(x)  can  be  expressed 
as  /(A)  (upper  green  line),  which  follows  the  curvature  of  f(x)  along  the 
one- dimensional  trajectory.  By  varying  xr,  this  trajectory  can  be  made 
to  extend  from  x  in  every  possible  direction.  Thus,  if  /(A)  is  necessarily 
concave  with  respect  to  X,  regardless  of  the  value  of  xr,  then  the  overall 
function  is  radially  concave  and  the  stationary  point  is  necessarily  the 
unique  maximum. 
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away  from  the  stationary  point,  then  the  stationary  point  must  be  the  unique  maximum. 
This  approach  reduces  the  maximality  test  to  evaluating  a  single  second  derivative  with 
respect  to  a  scalar  variable. 

To  set  up  the  concavity  test,  each  set  of  parameter  values  in  0  —  {/x.  T}  is  viewed 
as  a  “point”  in  the  domain  of  the  objective  function  (i.e.,  parameter  space).  Prom  any 
such  point,  a  family  of  radial  trajectories  is  obtained  by  defining  a  “reference  point” 
0r  —  {nr,  Tr}  and  a  set  of  “locus  points”  0  =  |/2,  f  |  that  satisfy 

/x  =  A/zr  +  (1  —  A)/2,  (176) 

r  =  ATr  +  (l  -  A)f  ,  (177) 


where  A  is  a  scalar  variable  satisfying  0  <  A  <  1.  Substitution  of  these  expressions  into 
the  objective  function  generates  a  modified  objective  function  over  a  restricted  parameter 
space.  For  a  given  0r ,  the  values  of  0  fall  along  a  radial  line  emanating  from  (but  not 
including)  0 ,  in  the  direction  opposite  0T.  The  direction  of  0  from  0  is  thus  a  function 
of  the  reference  point  location.  The  distance  of  0  from  0  varies  as  a  function  of  A,  with 
a  scale  factor  equal  to  the  distance  from  0  to  0r.  The  reference  point  0r  is  treated  as 
an  implicit  parameter  that  is  arbitrarily  chosen  but  is  a  fixed  constant  once  it  is  chosen. 
The  modified  function  is  then  defined  as 


Q(X.$)  =  Q( A#,  +  (l-A)»)  , 


(178) 


where  the  short-hand  notation  0  =  X0r  +  (1  —  X)0  is  used  to  indicate  that  equa¬ 
tions  (176)  and  (177)  are  satisfied.  To  show  that  the  objective  function  is  concave, 
the  modified  objective  function  is  twice  differentiated  with  respect  to  A.  The  resulting 
second  derivative  is  evaluated  at  the  stationary  point  0  =  0.  where  it  is  shown  to  be 
manifestly  negative.  The  information  term  r]i(0)  and  error  term  r/ j(0)  are  again  treated 
individually.  Both  terms  contribute  negative  values  to  the  overall  second  derivative. 

The  first  component  is  expressed  in  terms  of  A  by  substituting  equation  (177)  into 
equation  (160),  which  gives 


m(\0) 


K 


log 


AIY  +  (1-A)f 


(179) 


When  evaluating  this  determinant,  it  is  helpful  to  observe  that  the  constraint  on  the 
information  matrices  in  equation  (177)  is  satisfied  for  all  A,  which  implies  that  T,  Tr, 
and  f  are  all  diagonalized  by  the  same  set  of  eigenvectors.  Therefore,  there  exists  an 
orthogonal  matrix  U  for  which 


UTUt'  =  D  =  U  {ATr  +  (l  —  A)  f  |  UT  =  ADr  +  (1  -  A)  D  ,  (180) 
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where  D  =  diag(di, . . . ,  dm),  Dr  =  diag(dr,i, . . . ,  drjA/),  and  D  =  diag(di, . . . ,  dm)  contain 
the  eigenvalues  of  T,  Tr,  and  T,  respectively.  The  first  component  of  the  modified 
objective  function  is  thus  given  by 


K 


=  2  los 


M 

ADr  +  (1  —  A)  D  j  =  ^^log{Adrl  +  (l-A)d;},  (181) 

1=1 


which  is  differentiated  twice  with  respect  to  A  to  obtain 

2 


d2 

dx2 


M 


(A,0)  =  -k 


I  dri  cfj  J 


i=i  ^  A  dri  +  (1  —  A)  d{ 


M 

T2  =  ~K  T?  ri  ~  ' 

) 


(182) 


The  numerator  in  the  summand  of  equation  (182)  is  strictly  positive  since  dri  and  di 
cannot  be  equal.  In  general,  the  denominator  is  only  non-negative  (i.e.,  T  will  have 
zero- valued  eigenvalues  somewhere ),  but  d2  is  strictly  positive  at  the  stationary  point, 
as  long  as  the  number  of  independent  “measurements”  is  larger  than  the  dimension  of 
the  measurement  space.  Finally,  since  k  is  also  a  positive  number,  the  overall  expression 
for  the  second  derivative  is  therefore  negative,  regardless  of  the  value  of  Tr. 

Turning  now  to  the  quadratic-form  component,  the  manipulations  involved  with 
evaluating  r}2(0)  are  eased  by  defining  the  intermediate  variables 


such  that 


6  =  fi-  Hr,  (183) 

A  =  f  -  Tr ,  (184) 

y  (  =  we-  p.,  (185) 

r  =  A  Tr  +  (1  —  A)f  =  f-AA,  (186) 

cot  —  n  =  oji  —  A  \xT  —  (1  —  \)p,  =  y^  +  A(S.  (187) 


The  modified  function  is  evaluated  by  substituting  equations  (176)  and  (177)  into  equa¬ 
tion  (162)  and  collecting  terms  in  A  to  obtain 

ff2{X  d)  =  *  ^2  Pe  fa  tr  {  f  (in  +  f  yn  y,!  +  A  (^f  y„  d  -  A  Cln  -  Ay„y^j 

+  A2  (f  <5  <5T  -  2A  yn<5T)  -  A3  (A  <5  ST)  j  .  (188) 


D-7 


Differentiating  twice  with  respect  to  A  and  performing  some  algebra  then  gives 

^7/2(A.0)  =  k <5T T <5  —  2  A <5T  A  | K-/Z)|  .  (189) 

At  the  stationary  point,  the  mean  vector  must  satisfy  Y^=\  Pt4>i  (w*  —  A)  =  0,  such 
that  the  term  in  braces  in  equation  (189)  is  zero.  Combining  these  results  with  equa¬ 
tion  (182)  gives 

—  Q(\.d)  =  -KY/^(dri-dl)  -  ndTt 6  <  0 ,  (190) 

where  51  T<5  is  necessarily  positive  because  T  is  positive  definite.  This  expression  is 
independent  of  A  and  its  sign  is  independent  of  the  value  of  0r.  The  original  objective 
function  therefore  must  be  concave,  and  the  stationary  point  is  the  unique  maximum. 
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APPENDIX  E 

GAUSSIAN  MOMENTS  IN  AN  INTERVAL 


When  implementing  histogram-based  estimation  methods,  it  is  necessary  to  evalu¬ 
ate  the  bin  probabilities  and  the  first  two  local  moments  in  each  bin,  which  are  developed 
in  this  appendix  for  the  scalar  Gaussian  density 


j>(x;>i.a2)  =  (27r<72)" 1  2  exp  (x  —  /()2  j  . 


(191) 


The  integral  expressions  required  for  these  probabilities  are  not  very  exciting  or  inter¬ 
esting,  but  they  are  quite  useful.  In  what  follows,  the  bin-probability  and  local-moment 
computations  are  outlined  for  a  “typical”  bin  interval  I  =  [a,  b). 


Bin  Probability.  The  bin  probability  is  given  by  the  integral 

rb 


which  is  evaluated  by  performing  the  change  of  variables 
1 


y  = 


V2^ 


(x  -  //) 


=  \/2rr2  y  + 


y  +  fi 


dx  =  V 2rr2  dy 


to  obtain 

0(7,  p,( t2)  =  -^=  [  e~y2  dy  , 

V*  Ja 

where  the  modified  integration  boundaries  are 


(■liven  a  computational  routine  for  the  error  function 

2  fu  ■> 

erf(u)  =  — ■=.  /  e  x  dx , 

V*  Jo 

the  bin  probability  is  computed  as 


(192) 


(193) 


(194) 


(195) 


(196) 


(197) 


(198) 
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First  Moment.  The  first  moment  in  the  interval  is 


u(I,n,a2)  =  y==  x  exP  |  —  (x  —  //)2 1  dx .  (199) 

Adding  and  subtracting  //  JQ6  e~^x~^2^2a2^dx  yields 

u(I,  At,  a2)  =  d>(/,  n,  v2)  +  /t  4>(I,  At,  a2) ,  (200) 

where  u(I,y,,a2)  is  the  first  central  local  moment 

u(I,n,a2)  =  -/==  (x~v)  exp{_2^2  (x_^)2}  dx  ■  (201) 

This  central  moment  is  evaluated  by  performing  the  change  of  variables 

V  =  ^2  =>  (x  ~  v)  =  y1  2  =>.  dx  =  ±-^=  y~1/2dy  (202) 

to  obtain 


CU 


(/,//,  a2)  = 


a 


\[2^ 

The  first  moment  is  therefore 


uj(I,y.a2)  =  -^=  (e  °2  -  e  a2)  +  y<j)(Ly.a2) . 


(203) 


(204) 


Second  Moment.  The  second  moment  in  the  interval  is 

i}{I,y,a2)  =  ^=5  J  exp  (x~At)2J  dx.  (205) 

Completing  the  square  in  the  moment  variable  yields 

Q(/,At,  a2)  =  Cl(I,y,,a2) +  2y,u(I,n,a2)  —  y,2  <j)(I,y,,a2) ,  (206) 

where  fl(I,fi,a2)  is  the  second  central  local  moment 

£1(1,  a2)  =  -^==  jT  (x  At)2  exp|-^(x -//)2|  dx.  (207) 

This  expression  is  simplified  using  the  change  of  variables  defined  in  equation  (193)  to 
obtain 

1  f P  /  \  o—2 

n(I,fi.a2)  =  -J= _ _  J  (2a2  y2)  e~y 2  (VZr2  dy)  =  —  J  y2  e~y2  dy ,  (208) 


E-2 


where  a  and  f3  are  defined  in  equations  (195)  and  (196),  respectively.  Using  the 
integration-by-parts  formula  Judv  =  uv  —  Jvdu  with  dv  =  e~y2dy  and  u  =  y2 
(such  that  v  =  —e~y2/(2y)  and  du  =  2 ydy),  the  second  central  moment  is 


U(/,/qa2) 


2a2 
a 2_ 

V* 


|ae  0,2  —  (3  e  ^  +  2 rr2  0(7,  y,  a2) . 


(209) 


The  overall  second  moment  is  therefore  given  by 


il(Ly,a2)  =  (ae  °2  -  0e  ^  +  2a2  0(7,  y,  a2)  +  2yu)(L  y,  a2)  -  y2  0(7,  y,  rr2) . 


Substituting  the  definition  of  u;(7,//,cr2)  given  in  equation  (204),  collecting  like  terms, 
and  defining  the  function 


(210) 


gives  the  second  moment  as 

U(7,  y, rr2)  =  c(q)  e~a2  -  c(f3)  e~ 02  +  (2rr2  +  y2)  0(7,  y,  a2) . 


(211) 


Underflow  Issues.  The  expressions  given  above  for  the  bin  probabilities  and  mo¬ 
ments  become  unstable  when  interval  7  is  far  removed  from  the  mean  y.  In  these 
regions,  underflow  issues  become  a  dominating  factor.  This  difficulty  is  easily  avoided, 
however,  by  restricting  the  range  of  integration  to  those  bins  that  lie  within,  say,  ±8 
standard  deviations  of  the  mean,  effectively  treating  the  remaining  bins  as  a  “set  of 
measure  zero.” 
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